A Little Goes a Long Way: June 2010

Project 1: Take a batch of MARC library records and find out if the full text of the item they refer to can be found on the Internet Archive. If so, mirror the full text locally and update the MARC record with the URLs of the local copy and the online original.

MARC is a format of library record that outdates ISBNs. It was designed to be used on magnetic tape. It's not very flexible for the world of computing. However, decades of library records are still using MARC, so library computers are, too. My task:

To parse MARC records (on a computer of course, not on magnetic tape) and extract relevant data. For our initial purposes, I only extracted contents of the author and title fields.
To send the extracted data as a URL to query the Internet Archive. And retrieve the results.
To sort the various results into three categories: no hits at the Archive, one hit, and multiple hits.
To mirror the "one-hit" results locally.
To update the MARC records with the original and local URLs of the full text.

Parsing MARC

This was my first encounter with MARC. I opted to program in Java, as the language I know best and which seems most flexible and intuitive to me. Turns out this was a good choice--there exists a Marc4j Java library made available by Tigris.

I also chose to use the Eclipse IDE ("Galileo" version) as my programming environment. Good choice again--although most of my coding experience has been on the command-line or in a text-editor, Eclipse is extremely flexible and easy-to-use. Not to mention time-saving: It's easy to navigate through code and libraries, and it simply has the best debugger I've seen fully integrated.

Here's my first Java program in eight years, with comments.

package iasearcher;
/**
*
*/

/**
* @author slittle2
*
* MarcRetriever 1.1 should do the following, I stopped using version numbers in comments after awhile, since SVN takes care of all that.
* as long as there are unchecked records in the database:
* - keep track of its location in the database
* - retrieve the next unread MARC record from the DB
* - strip the relevant info from the record (author & title)
* - output the relevant info to a text file, w/ line breaks to separate data
* - - Two line breaks indicates new record
*
*/ Being a humanities PhD, I find comments come quite easily, as does other documentation. Must come from the writing habits formed through long hours of research and writing.

import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.net.URLEncoder;

import org.marc4j.MarcReader;
import org.marc4j.MarcStreamReader;
import org.marc4j.marc.DataField;
import org.marc4j.marc.Record;
import org.marc4j.marc.Subfield;

public class MarcRetriever {

    /**
    * @param args
    */
    public static void main(String[] args) throws Exception {

        OutputStreamWriter out = null;
        InputStream in = null;
        int recordsRead = 0;

        try {
            in = new FileInputStream(
                    "C:/Program Files/Java/marc4j-2.4/MARCdata/notre-dame.marc");

Initially, I hard-coded my file paths into the classes. These projects taught me the value of code-reusability, however, so I've since taken to loading properties files instead.
            boolean append = false;
            boolean searchRecords = true; This decides whether I'm actually loading new records or just working with the program interface. Again, today I would use a properties file for this.
            out = new FileWriter("C:/Documents and Settings/slittle2/workspace/MarcRetriever/searchkeys.txt", append);
            MarcReader reader = new MarcStreamReader(in);

            // Set default values of strings

These are actual values from a MARC record that corresponds to one of the Internet Archive's records. Because I knew they would return a positive hit, I used them as defaults, mainly to test the rest of the code.
            String author1 = "Lord, Daniel A.";
            String author2 = " "; Each of these "author" strings corresponds to one of the MARC author fields.
            String author3 = " "; I extracted three fields, so I used three different strings to hold the results.
            String title = "The story of Christmas /";
            String url = null;

            Next comes a piece of test code. Now that I know the beauties of JUnit, I wouldn't do this sort of testing within the main() method of the class itself anymore.
            // Write default values to file - As noted above, this option doesn't load and parse any new MARC records; it just tests that the output is working.
            if (!searchRecords) {
                url = CreateURL(author1 + author2 + author3, title);

                System.out.println(url);

                out.append(url + "\n");
            }
Here's where the actual meat of the program is.
            // Search the MARC records if desired
            else {
Unlike EAD files, which are each an individual file, MARC records can be saved to a single file collectively. So the Marc4j reader simply reads in each record sequentially and grabs the author and title fields, as indicated earlier.
                while (reader.hasNext()) {
                    Record record = reader.next();

                    try {
MARC field 100 is the author. It's broken into a number of subfields, of which I wanted only three.
First I had to retrieve the field as a whole. (Most MARC records are data fields, and Marc4j lets you access individual subfields through various methods. A few fields are called 'control fields' and are harder to parse.)
                        // get data field 100
                        DataField field = (DataField) record
                                .getVariableField("100");

                        // get the author proper, part 1
                        Subfield subfield;

                        try {
                            subfield = field.getSubfield('a');
                            author1 = subfield.getData();
                        } catch (NullPointerException npe) {
                            author1 = " ";
                        }

                        // get the author proper, part 2
                        try {
                            subfield = field.getSubfield('b');
                            author2 = subfield.getData();
                        } catch (NullPointerException npe) {
                            author2 = " ";
                        }

                        // get the author proper, part 3
                        try {
                            subfield = field.getSubfield('c');
                            author3 = subfield.getData();
                        } catch (NullPointerException npe) {
                            author3 = " ";
                        }
Now we do the same thing with the 'title' field, this time accessing only one subfield.
                        // get data field 245
                        field = (DataField) record.getVariableField("245");

                        try {
                            // get the title proper
                            subfield = field.getSubfield('a');
                            title = subfield.getData();
                        } catch (NullPointerException npe) {
                            title = " ";
                        }

                        // Create URL and write to file
                        url = CreateURL(author1 + author2 + author3, title);
                        Here I save the resulting URL and accompanying info to a tab-delimited file. Today, if I had to pass such complex data between classes, I would normally use a separate class to handle the custom data type that the methods would access. However, for our purposes it sufficed to save it to a file.

Originally I only had the program save the URL. Because later classes needed access not only to the URL but also to the author/title data, I had to go back and rewrite this method to output the author and title as well as the URL. Other classes also saved their own data, to different files. If I had to take this approach again (saving to files instead of just passing data in RAM), I would first pseudocode until I knew all of the kinds of information that I needed, and then keep a single tab-delimited file with all such info in it, which could be accessed and updated by all the classes.
                        out.append(author1 + author2 + author3 + "\t" + title + "\t" + url + "\n");

                        ++recordsRead; I kept track of the number of records read for testing purposes.

                    } catch (NullPointerException npe) {
                        System.out.println("***NPE***");
                    }

                } // end while
            } // end if
        } finally {

            // Close input/output streams - One of the most important things I learned: Always close your streams. Except for the System ones.
            if (out!=null) out.close();
            if (in!=null) in.close();
            System.out.println("Total records read: " + recordsRead);
        }


    }
It was simple enough to query the Internet Archive a few times and figure out the appropriate format for their query URLs. Then I just broke the URL apart and saved each piece in a separate string, to be assembled later on into a workable query.
    // Creates a URL to return with the proper format for the Internet Archive
    private static String CreateURL(String author, String title) throws Exception {
        String url = null;
        String url1 = "http://www.archive.org/advancedsearch.php?q=";
        String url2 = "title:("; // + title +
        String url3 = ") AND creator:("; // + author +
        String url4 = ")&fl[]=identifier&sort[]=&sort[]=&sort[]=&rows=50&page=1&callback=callback&output=xml";
Note that the returned info is in XML.

        // Create final url to return
        url = url2 + title + url3 + author;

        url = URLEncoder.encode(url, "UTF-8");

        url = url1 + url + url4;

        return url;
    }

}

A Little Goes a Long Way

Wednesday, June 30, 2010

Find the Full Text

Catholic Portal Coding

Evolution

Followers