A Little Goes a Long Way: July 2010

Friday, July 30, 2010

A new job! - and - Meet Textile!

So, at long last, I have a full-time (temp) job! I'll be helping with IT at a university press. :-)

I haven't started yet, but I've been hanging around to learn more about the press and my responsibilities therein. My job will in large part consist of overseeing a MySQL database/FileMaker Pro installation and doing tech support for my coworkers.

Today, I got my hands dirty in the press for the first time. It was also my introduction to Textile. Let me explain.

My coworkers use Textile to code their website. I've never used it before. But thanks to a good search engine (Google) and Wikipedia, I quickly tracked down the solution to a problem plaguing a coworker in Marketing.

Problem: With perfect syntax, a link was nonetheless not showing up as a link.
Solution: She had to change smart-quotes to straight-quotes. Suddenly, it worked fine!

What's more interesting to me is the tools I found along the way: the Full Syntax Reference for Textile. The really nifty part of this site, though, is that it has a Textile translator. Suppose you enter

"http://www.google.com":http://www.google.com

The translator converts this into XHTML...

<p><a href="http://www.google.com">http://www.google.com</a></p>

and shows how it will appear on a webpage!

http://www.google.com

Voila'!

I ALSO learned that FileMaker Pro 10 moved its "Toggle Smart Quotes" setting out of Preferences into File Options (under the File Menu).

Unresolved: Even though the Smart Quotes option was turned off on my coworker's machine, it was still putting them in! Go figure...

Thursday, July 22, 2010

Project 1 - Displaying the Results

This class is unfinished. It was to display the results of the previous operations in an HTML browser window, using the default browser of the user's computer. But as it turned out, so few results were returned from the Internet Archive--and so prohibitive would it have been to sift the good results from false positives--that the project was scrapped.

Up side: I saved my department time and money by showing that this wasn't worth pursuing!

/**
*
*/
package iasearcher;

import java.io.*;
import java.util.*;

/**
* @author slittle2
*
*    IAResults.java 1.0 displays an HTML page with the results of
*    the IASearcher.java search. It divides the results into three
*    parts: No hits, one hit, and multiple hits. Each gives the title
*    and author of the work, plus links to search the IA full text
*    and the CRRA record so that the user can compare the two. It also
*    prints the IA "key" next to the links. The multiple results page
*    displays the multiple results in a subordinate list.
*
*    To do in the future: Generate HTML and open without having to save
*    save to a file (unless having the HTML results is desirable?).
*
*    Uses http://java.sun.com/javase/6/docs/api/java/awt/Desktop.html#browse%28java.net.URI%29
*
*/
public class IAResults{

    /**
    * @param args
    */
    public static void main(String[] args) throws IOException {

        /* Web page should look like this:
        *
        * Report 1 - No hits
        *     * Title/Author [(Search IA) (Search CRRA) ?]
        *     ...
        *
        * Report 2 - 1 hit
        *     * Title/Author (Search IA) (Search CRRA) (Key)
        *     ...
        *
        * Report 3 - Multiple hits
        *     * Title/Author
        *         * (Search IA) (Search CRRA) (Key)
        *         * (Search IA) (Search CRRA) (Key)
        *     ...
        *
        *
        * Basic code:
        *
        *     <html>
        *
        *     <body>
        *
        *     <h1>Report 1 - No hits</h1>
        *     <ul>
        *         <li>Title/Author</li>
        *         ... (more results)
        *     </ul>
        *
        *     <h1>Report 2 - 1 hit</h1>
        *     <ul>
        *         <li>Title/Author (Search IA) (Search CRRA) (Key)</li>
        *         ... (more results)
        *     </ul>
        *
        *     <h1>Report 3 - Multiple hits</h1>
        *     <ul>
        *         <li>Title/Author
        *             <ul>
        *                 <li>(Search IA) (Search CRRA) (Key)</li>
        *                 ... (more results)
        *             </ul>
        *         </li>
        *         ... (more results)
        *     </ul>
        *
        *     </body>
        *
        *     </html>
        *
        */

        // Initialize variables
        BufferedReader noResultsFile = null;
        BufferedReader oneResultFile = null;
        BufferedReader manyResultsFile = null;

        LinkedHashSet noResultsSet = new LinkedHashSet(); // Sets to import the results data into
        LinkedHashSet oneResultSet = new LinkedHashSet();
        LinkedHashSet manyResultsSet = new LinkedHashSet();

        String data = " "; // Generic variable used for reading Strings


        // Open files and load results into appropriate sets
        try {
            noResultsFile = new BufferedReader((Reader) new FileReader("C:/Documents and Settings/slittle2/workspace/MarcRetriever/noResults.txt"));
            oneResultFile = new BufferedReader((Reader) new FileReader("C:/Documents and Settings/slittle2/workspace/MarcRetriever/oneResult.txt"));
            manyResultsFile = new BufferedReader((Reader) new FileReader("C:/Documents and Settings/slittle2/workspace/MarcRetriever/manyResults.txt"));



            while ((data = noResultsFile.readLine()) != null) {
                noResultsSet.add(data);
            }

            while ((data = oneResultFile.readLine()) != null) {
                oneResultSet.add(data);
            }

            while ((data = manyResultsFile.readLine()) != null) {
                manyResultsSet.add(data);
            }

            // System.out.println(noResultsSet.toString()); TODO remove test code
            // System.out.println(oneResultSet.toString());
            // System.out.println(manyResultsSet.toString());

        }catch (FileNotFoundException e){
            System.err.println("*** File Not Found ***");
            e.getStackTrace();
        }finally{
            if(noResultsFile != null) noResultsFile.close();
            if(oneResultFile != null) oneResultFile.close();
            if(manyResultsFile != null) manyResultsFile.close();
        }

        // Output strings into a single HTML file

        Iterator iter = noResultsSet.iterator(); // TODO find author/title pairs
        while(iter.hasNext()){
            data = (String) iter.next();
            System.out.println(data); // TODO remove test code
        }

        iter = oneResultSet.iterator(); // TODO find author/title pairs
        while(iter.hasNext()){
            data = (String) iter.next();
            System.out.println(data); // TODO remove test code
        }

        iter = manyResultsSet.iterator(); // TODO find author/title pairs; break down strings into substrings
        while(iter.hasNext()){
            data = (String) iter.next();
            System.out.println(data); // TODO remove test code
        }

        // Open HTML file with .awt.Desktop class

    }

}

Project 1 - Updating the MARC Records

Once the Internet Archive's documents were mirrored locally, I had to add the local and IA URLs to the MARC records. In practice, since I was using a file and not directly accessing the MARC database, I saved the revised records to a new file, which could then be added to the database.

/**
*
*/
package iasearcher;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.Reader;
import java.io.Writer;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.net.URLConnection;
import java.util.Iterator;
import java.util.LinkedHashSet;

import org.marc4j.MarcReader;
import org.marc4j.MarcStreamReader;
import org.marc4j.MarcStreamWriter;
import org.marc4j.MarcWriter;
import org.marc4j.marc.DataField;
import org.marc4j.marc.MarcFactory;
import org.marc4j.marc.Record;
import org.marc4j.marc.Subfield;

/**
* @author slittle2
*
* Once files have been retrieved from the Internet Archive,
* UpdateMarc updates the MARC records with two things:
*
*     - the URL of the IA directory
*     - the URL of the local copy
*
* Each is saved into a new 856 subfield U
*
*/
public class UpdateMarc {

    /**
    * @param args
    * @throws IOException
    */

    // Here for testing purposes
    public static void main(String[] args) throws IOException {

        // Values for test run - may be changed as needed
        String marcFile = "C:/Documents and Settings/slittle2/Desktop/updated.marc";
        String tempFile = "C:/Documents and Settings/slittle2/Desktop/temp.marc";
        String oneHitLog = "C:/Documents and Settings/slittle2/workspace/MarcRetriever/Success Files 5-26/oneResult.txt";

        updater(marcFile, oneHitLog, tempFile);

    }

    public static void updater(String marcFile, String oneHitLog, String tempFile)
            throws IOException {
        LinkedHashSet<KeyDatum> keyData = searchKATIL(oneHitLog);

        boolean append = true;

        // Find and update the appropriate MARC record:

        // Open MARC database
        InputStream in = null;
        OutputStream out = null;

        try {
            in = new FileInputStream(marcFile);
            out = new FileOutputStream(tempFile, append);
            MarcReader reader = new MarcStreamReader(in);
            MarcWriter writer = new MarcStreamWriter(out);

            // While iterator.hasNext(), search the MARC records for all
            // matching author/title

            while (reader.hasNext()) {
                Record record = reader.next();

                String author = "";
                String title = "";
                // Create iterator over keyData
                Iterator<KeyDatum> iter = keyData.iterator();

                // Match current record author/title against entire keyData list
                author = getFullAuthor(record);
                title = getTitle(record);

                while(iter.hasNext()){

                    KeyDatum datum = (KeyDatum) iter.next();

                    // If found:
                    // Add 856$U w/ $Z "Original Location" & IA URL
                    // Add 856$U w/ $Z "Local Mirror" & local URL
                    if(author.equalsIgnoreCase(datum.author) & title.equalsIgnoreCase(datum.title)){
                        System.out.println("It matches!\t" + record);

                        // add a data field for IA URL
                        MarcFactory factory = MarcFactory.newInstance();
                        DataField df = factory.newDataField("856", '0', '4');
                        df.addSubfield(factory.newSubfield('u', datum.iaURL));
                        df.addSubfield(factory.newSubfield('z', "ORIGINAL LOCATION"));
                        record.addVariableField(df);

                        // add another data field for local URL
                        DataField dq = factory.newDataField("856", '0', '4');
                        dq.addSubfield(factory.newSubfield('u', datum.localURL));
                        dq.addSubfield(factory.newSubfield('z', "LOCAL MIRROR"));
                        record.addVariableField(dq);

                        writer.write(record);

                        System.out.println("Updated Record:\t" + record);

                        break;
                    }


                } // end while

            } // end while

            writer.close();

        } finally {

            // Close input/output streams
            if (out != null)
                out.close();
            if (in != null)
                in.close();
        }

    }

    private static String getTitle(Record record) {
        // get data field 245
        DataField field = (DataField) record.getVariableField("245");
        Subfield subfield;
        String title = "";

        try {
            // get the title proper
            subfield = field.getSubfield('a');
            title = subfield.getData();
        } catch (NullPointerException npe) {
            title = " ";
        }

        return title;
    }

    private static String getFullAuthor(Record record) {

        String author1 = "";
        String author2 = "";
        String author3 = "";

        // get data field 100
        DataField field = (DataField) record
                .getVariableField("100");

        // get the author proper, part 1
        Subfield subfield;

        try {
            subfield = field.getSubfield('a');
            author1 = subfield.getData();
        } catch (NullPointerException npe) {
            author1 = " ";
        }

        // get the author proper, part 2
        try {
            subfield = field.getSubfield('b');
            author2 = subfield.getData();
        } catch (NullPointerException npe) {
            author2 = " ";
        }

        // get the author proper, part 3
        try {
            subfield = field.getSubfield('c');
            author3 = subfield.getData();
        } catch (NullPointerException npe) {
            author3 = " ";
        }
        return author1 + author2 + author3;
    }

    // Gets the Key, Author, Title, and IA & Local URL
    private static LinkedHashSet<KeyDatum> searchKATIL(String oneHitLog) throws IOException {

        LinkedHashSet<KeyDatum> kati = new LinkedHashSet<KeyDatum>();
        LinkedHashSet<KeyDatum> previous = new LinkedHashSet<KeyDatum>();

        // Open file
        BufferedReader inFile = null; // create a new stream to open a file
        BufferedReader inFile2 = null;
        BufferedWriter outFile = null;
        final String addressRoot = "http://www.archive.org/download/";
        final String localRoot = "http://zoia.library.nd.edu//sandbox/books";
        final String outFileLocation = "C:/Documents and Settings/slittle2/Desktop/outFile.txt";

        try {
            inFile = new BufferedReader((Reader) new FileReader(oneHitLog));
            inFile2 = new BufferedReader ((Reader) new FileReader(outFileLocation));
            String data = " ";
            String data2 = " ";
            boolean old = true; // This is true because all the results should be stored in a local file now.

            // Load previous results into memory
            while((data2 = inFile2.readLine()) != null) {
                String[] splitData2 = data2.split("\t");
                previous.add(new KeyDatum(splitData2[0],splitData2[1],splitData2[2],splitData2[3],splitData2[4]));
            }
            inFile2.close();

            outFile = new BufferedWriter((Writer) new FileWriter(outFileLocation, true));

            // Retrieve URLs from file & send to Internet Archive
            while ((data = inFile.readLine()) != null) {

                // Extract keys
                String[] splitData = data.split("\t");

                // Load each Key, Author, Title into a KeyDatum; leave other two
                // blank
                KeyDatum keyDatum = new KeyDatum(splitData[2], splitData[0],
                        splitData[1], "", "");

                // Check and see if already in previous results
                Iterator<KeyDatum> iter = previous.iterator();
                while (iter.hasNext()) {
                    KeyDatum next = iter.next();
                    if (keyDatum.compareQuick(next)) {
                        old = true;
                        kati.add(next);
                        break;
                    }
                }

                if (!old) {
                    // Generate IA URL
                    keyDatum.iaURL = addressRoot + keyDatum.key + "/";

                    // Generate local URL
                    data = (keyDatum.iaURL).toString();
                    data = redirectAndTrim(data);
                    keyDatum.localURL = data.replaceFirst("http:/", localRoot);

                    outFile.append(keyDatum.toString("\t"));
                // Adds the new KeyDatum to the LHS
                kati.add(keyDatum);
                System.out.println(keyDatum.toString("\t"));
                }

            }

        } catch (MalformedURLException e) {
            System.err.println("*** Malformed URL Exception ***");
        } catch (FileNotFoundException e) {
            System.err.println("*** File not found! ***");
            e.printStackTrace();
        } catch (IOException e) {
            System.err.println("*** IO Exception ***");
            e.getStackTrace();
        } finally {
            if (inFile != null)
                inFile.close();
            if (outFile != null)
                outFile.close();
        }

        return kati;
    }

    // TODO Can't I just use the one in IASearcher?
    protected static String redirectAndTrim(String key) throws IOException {
        // Retrieve the redirected URL from IA

        URI uri = null;
        URL url = null;
        InputStream inURI = null;
        String newURL = "";

        try {

            // Open connection to IA
            uri = new URI(key);

            url = uri.toURL();


            URLConnection yc = url.openConnection();
            HttpURLConnection h = (HttpURLConnection) yc;
            HttpURLConnection.setFollowRedirects(true);
            h.getInputStream(); // Necessary to force redirect!
            newURL = h.getURL().toString();

            return newURL;

            // Catching errors
        } catch (URISyntaxException e) {
            System.err.println("*** URI Syntax Exception ***");
            e.printStackTrace();
        } catch (MalformedURLException e) {
            System.err.println("*** Malformed URL Exception ***");
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            System.err.println("*** File not found! ***");
            e.printStackTrace();
        } catch (IOException e) {
            System.err.println("*** IO Exception ***");
            e.getStackTrace();
        } finally {
            if (inURI != null)
                inURI.close();
        }

        return null;
    }

}


// Class for handling the various kinds of data
// Each key maps to 1 each of: author, title, IA URL, & local URL
class KeyDatum {

    protected String key;
    protected String author;
    protected String title;
    protected String iaURL;
    protected String localURL;

    KeyDatum() {
        key = "";
        author = "";
        title = "";
        iaURL = "";
        localURL = "";
    }

    KeyDatum(String k, String a, String t, String i, String l) {
        key = k;
        author = a;
        title = t;
        iaURL = i;
        localURL = l;
    }

    // Returns all fields as a string separated by the passed string (e.g. \n or \t
    public String toString(String c){
        return new String(key + c + author + c + title + c + iaURL + c + localURL + c + "\n");
    }

    public boolean compare(KeyDatum datum){

        if(this.key.equalsIgnoreCase(datum.key) &
                this.author.equalsIgnoreCase(datum.author) &
                this.title.equalsIgnoreCase(datum.title) &
                this.iaURL.equalsIgnoreCase(datum.iaURL) &
                this.localURL.equalsIgnoreCase(datum.localURL))
            return true;

        return false;
    }

    public boolean compareQuick(KeyDatum datum) {

        if(this.key.equalsIgnoreCase(datum.key))
            return true;

        return false;
    }
}

Project 1 - Parsing the XML File

Returning to Project 1: This class parsed the resulting XML file.

package iasearcher;
/**
*
*/

/**
* @author slittle2
*
*    This class contains code modified from http://www.exampledepot.com/egs/javax.xml.parsers/BasicSax.html
*
*
*
*/

import java.io.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class IASearcherXMLReader {

    private static int recordQuantity = 0;
    private static String identifiers = "";
    private static boolean grabCharacters = false;
    private static boolean failFlag = false;

    public static void main(String[] args) { // Create a handler to handle the SAX events generated during parsing
        parse();
    }

    // Method called by main or another class to parse an XML file
    public static void parse() {

        // Re-initialize variables -- necessary!
        recordQuantity = 0;
        identifiers = "";
        grabCharacters = false;
        failFlag = false;

        DefaultHandler handler = new XMLHandler(); // Parse the file using the handler
        parseXmlFile("C:/Documents and Settings/slittle2/workspace/MarcRetriever/output.xml", handler, false);

    }

    // To get the number of records found
    public static int getRecordQuantity(){
        return recordQuantity;
    }

    // To get the String of identifiers
    public static String getIdentifiers() {
        return identifiers;
    }

    // To return whether the parse succeeded
    public static boolean getFailFlag() {
        return failFlag;
    }

    // DefaultHandler contain no-op implementations for all SAX events.
    // This class should override methods to capture the events of interest.
    static class XMLHandler extends DefaultHandler {
        public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
            if(qName.equals("result")) { // "numFound" attribute is second, i.e. "1"

                recordQuantity = (Integer.valueOf(attributes.getValue("numFound"))).intValue();
            }

            try {
                if (attributes.getValue(0).equals("identifier")) {
                    grabCharacters = true;
                }
            } catch (NullPointerException npe) {
                // No attributes present! Move along!
            }
        }

        public void characters(char[] ch, int start, int length)
                throws SAXException {
            if (grabCharacters) {
                identifiers += new String(ch, start, length) + "\t";
                grabCharacters = false;
            }
        }

    }

    // Parses an XML file using a SAX parser.
    // If validating is true, the contents is validated against the DTD
    // specified in the file.
    public static void parseXmlFile(String filename, DefaultHandler handler, boolean validating) {
        try { // Create a builder factory
            SAXParserFactory factory = SAXParserFactory.newInstance();
            factory.setValidating(validating); // Create the builder and parse the file
            factory.newSAXParser().parse(new File(filename), handler);


        } catch (SAXException e) { // A parsing error occurred; the xml input is not valid
            System.err.println("*** SAX Exception ***");
            failFlag = true;
        } catch (ParserConfigurationException e) {
            System.err.println("*** Parser Configuration Exception ***");
            failFlag = true;
        } catch (IOException e) {
            System.err.println("*** IO Exception ***");
            failFlag = true;
        } // End try-catch
    }

}

Wednesday, July 21, 2010

Project 2 - JUnit

Well, it isn't very fancy, but it provided a decent intro to JUnit testing! :-)

/**
*
*/
package crrasolrindexer;

import static org.junit.Assert.*;

import java.io.IOException;

import org.junit.After;
import org.junit.Before;
import org.junit.Test;

/**
* @author slittle2
*
*/
public class CRRA_DatumTest {

    private CRRA_Datum testCD = null;
    CRRA_Datum crraCD = null;

    /**
    * @throws java.lang.Exception
    */
    @Before
    public void setUp() throws Exception {
        crraCD = new CRRA_Datum("id allfields institution collection building language format author author-letter authorStr auth_author auth_authorStr title title_sort title_sub title_short title_full title_fullStr title_auth physical publisher publisherStr publishDate edition description contents url thumbnail lccn ctrlnum isbn issn callnumber callnumber-a callnumber-first callnumber-first-code callnumber-subject callnumber-subject-code callnumber-label dewey-hundreds dewey-tens dewey-ones dewey-full dewey-sort author2 author2Str author2-role auth_author2 auth_author2Str author_additional author_additionalStr title_alt title_old title_new dateSpan series series2 topic genre geographic illustrated recordtype");
    }

    /**
    * @throws java.lang.Exception
    */
    @After
    public void tearDown() throws Exception {
        crraCD = null;
    }

    /**
    * Test method for {@link crrasolrindexer.CRRA_Datum#CRRA_Datum(java.lang.String)}.
    */
    @Test
    public final void testCRRA_DatumString() {
        testCD = new CRRA_Datum("foo bar");
        assertNotNull(testCD);
    }

    /**
    * Test method for {@link crrasolrindexer.CRRA_Datum#CRRA_Datum()}.
    */
    @Test
    public final void testCRRA_Datum() {
        testCD = new CRRA_Datum();
        assertNotNull(testCD);
        assertEquals(testCD.toString(), crraCD.toString());
        assert(testCD.equals(crraCD));
    }

    /**
    * Test method for {@link crrasolrindexer.CRRA_Datum#returnField(java.lang.String)}.
    * @throws IOException
    */
    @Test
    public final void testReturnField() throws IOException {
        testCD = new CRRA_Datum("foo bar");
        testCD.setField("foo", "boo!");
        assertEquals("boo!", testCD.returnField("foo"));
    }

    /**
    * Test method for {@link crrasolrindexer.CRRA_Datum#concatenateField(java.lang.String, java.lang.String)}.
    * @throws IOException
    */
    @Test
    public final void testConcatenateField() throws IOException {
        crraCD.setField("author", "Bob Shakespeare");
        crraCD.concatenateField("author", " & Joe Shakespeare");
        assertEquals("Bob Shakespeare & Joe Shakespeare", crraCD.returnField("author"));
    }

    /**
    * Test method for {@link crrasolrindexer.CRRA_Datum#toString()}.
    */
    @Test
    public final void testToString() {
        assertNotNull(crraCD.toString());
    }

}

Project 2 - Enhanced Data Class

The class just posted required an enhanced data class to model the VuFind records. Here it is!

/**
*
*/
package crrasolrindexer;

import java.io.*;
import java.util.Iterator;
import java.util.LinkedHashSet;

/**
* @author slittle2
*
*   The CRRA_Datum class is just like the IndeDatum class, but modified to work
*   with the CRRA schema (or any other). It holds the kinds of data that are to be extracted
*   from an indexed field, whether MARC or EAD (or anything else). All fields
*   are private; only a protected method allows setting them, and a public
*   method allows retrieving their contents. (The setField() method is
*   protected so that one has to change the data through the appropriate
*   class that indexes the data.
*
*   This class is designed to be easily extensible for use with other kinds
*   of data. UNLIKE IndexDatum, it may be passed as string of schema names from
*   which it builds a new schema for its entries.
*
*   Note that the schema names used here are independent of the schema *map* that
*   the CRRA_EADRetriever class uses to map from EAD to VuFind. If inconsistences
*   occur between the schema here and the schema_map (or between these and the actual
*   VuFind schema), unpredictable behavior may result.
*
*/
public class CRRA_Datum {

    private class Entry {
       String name;
       String content;

       Entry() {
           name = "";
           content = "";
       }

       Entry(String n, String c){
           name = n;
           content = c;
       }

    }

    // The default schema is that used in Vufind as of this coding (June 2010).
    private String schema_names = "id fullrecord allfields institution collection building language format author author-letter authorStr auth_author auth_authorStr title title_sort title_sub title_short title_full title_fullStr title_auth physical publisher publisherStr publishDate edition description contents url thumbnail lccn ctrlnum isbn issn callnumber callnumber-a callnumber-first callnumber-first-code callnumber-subject callnumber-subject-code callnumber-label dewey-hundreds dewey-tens dewey-ones dewey-full dewey-sort author2 author2Str author2-role auth_author2 auth_author2Str author_additional author_additionalStr title_alt title_old title_new dateSpan series series2 topic genre geographic illustrated recordtype";
    private LinkedHashSet<Entry> entries = null;

    // Pass a string containing scheme file names separated by a ' '.
    public CRRA_Datum(String schema_names) {

       entries = new LinkedHashSet<Entry>();

       String[] schema = schema_names.split(" ");

       for(int i = 0; i < schema.length; i++){
           entries.add(new Entry(schema[i], ""));
       }

    }

    // Default contructor uses current (2010) Vufind schema names
    public CRRA_Datum() {

       entries = new LinkedHashSet<Entry>();

       String[] schema = schema_names.split(" ");

       for(int i = 0; i < schema.length; i++){
           entries.add(new Entry(schema[i], ""));
       }

    }

    // Return the names of the schema fields as a single string. This can then be parsed/tokenized as needed.
    public String returnSchemaNames(){
       return schema_names;
    }

    // Return a given field's value.
    public String returnField(String fieldName) throws IOException {

       Iterator<Entry> iter = entries.iterator();

       while(iter.hasNext()){
           Entry entry = (Entry) iter.next();
           if(entry.name.equalsIgnoreCase(fieldName)){
               return entry.content;
           }
       }
       throw new IOException();

    }

    // Set the value of a given field. Completely overwrites the original.
    protected void setField(String fieldName, String data) throws IOException {

       Iterator<Entry> iter = entries.iterator();

       while(iter.hasNext()){
           Entry entry = (Entry) iter.next();
           if(entry.name.equalsIgnoreCase(fieldName)){
               entry.content = data;
               return;
           }
       }
       throw new IOException();

    }

    // Adds data to a field without overwriting it.
    protected void concatenateField(String fieldName, String data) throws IOException {

       Iterator<Entry> iter = entries.iterator();

       while(iter.hasNext()){
           Entry entry = (Entry) iter.next();
           if(entry.name.equalsIgnoreCase(fieldName)){
               entry.content += data;
               return;
           }
       }
       throw new IOException();

    }

    // Displays the fields/values of the entire Datum.
    public String toString(){
       String contents = "";

       Iterator<Entry> iter = entries.iterator();

       while(iter.hasNext()){
           Entry entry = (Entry) iter.next();

           contents += "\n " + entry.name + "\t\t"+ entry.content;

       }

       return contents;
    }

}

Project 2 - Beefing Up EAD File-Handling

This class is far more powerful than its predecessor, EADDataRetriever. Added functionality includes: sending data to multiple VuFind fields at once; handling multiple kinds of schema files and EAD-VuFind crosswalks, and better documentation!

package crrasolrindexer;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.Collection;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.LinkedHashSet;
import java.util.LinkedList;
import java.util.Set;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParserFactory;

import org.apache.solr.client.solrj.SolrServerException;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

/**
* @author slittle2
*
*         CRRA_EADRetriever does the same thing as EADDataRetriever, except
*         using the new CRRA_Datum class.
*
*         To this end, it cycles through the files in the given directory,
*         parses each, extracts the relevant data, and puts it in a
*         LinkedHashSet<CRRA_Datum>. Voila'!
*
*/

public class CRRA_EADRetriever {

    /**
    * @param args
    */

    // This is used only in testing; if called from TextUICRRASI, it uses whatever is in the given properties file.
    private static String testPathName = "C:/Documents and Settings/slittle2/Desktop/Index Data/ead/xml/";

    // The set of records to send to the Indexer
    private static LinkedHashSet<CRRA_Datum> eadRecords = new LinkedHashSet<CRRA_Datum>();
    private static int recordQuantity = 0; // The quantity of EAD records parsed
    private static boolean grabCharacters = false; // Whether or not to stuff data into fields besides "allfields"
    private static boolean failFlag = false; // Whether or not the parsing operation succeeded

    private static CRRA_Datum datum = new CRRA_Datum(); // Using default schema here

    // The user-defined mapping from EAD to VuFind will be stored here
    private static LinkedHashMap<LinkedHashSet<String>, String> schema_map = new LinkedHashMap<LinkedHashSet<String>, String>();

    private static LinkedList<String> currentFieldSet = new LinkedList<String>(); // for keeping track of the current fields to send data to
    private static LinkedList<String> tagStack = new LinkedList<String>(); // for keeping track of tag nesting while parsing

    // fieldName sets the field in datum that the parser passes data to,
    // and then datum is sent to eadRecords
    private static String fieldName = "";

    // Name for file with schema
    public static String schema_filename = "";

    // Whether to force parser to evaluate whether the tagStack and a given element "path" are identical before passing data
    public static boolean strictElementPaths = false;

    // Name for file with schema presets
    public static String schema_presets = "";

    // Stores the presets to add to each record right before moving on to the next parsing
    private static LinkedHashMap<String, String> presets_map = new LinkedHashMap<String, String>();

    // Main() included more or less for testing purposes
    public static void main(String[] args) throws IOException, SolrServerException {
        eadLoader(testPathName);

        System.out.println("Number of records = " + eadRecords.size());

        Iterator<CRRA_Datum> iter = eadRecords.iterator();
        while (iter.hasNext()) {
            datum = (CRRA_Datum) iter.next();

            System.out.println(datum.toString());

        }

        Indexer.indexCD(eadRecords, "http://localhost:8983/solr/core0/");

        System.out.println("Successfully indexed... we hope.");

    }

    // Cycling through the files in the directory and loading each in turn
    public static LinkedHashSet<CRRA_Datum> eadLoader(String pathname) throws IOException, SolrServerException {

        String filename = "";

        // Initialize variables -- must be cleared every time parser is run!
        recordQuantity = 0;
        failFlag = false;
        datum = new CRRA_Datum();
        eadRecords = new LinkedHashSet<CRRA_Datum>();

        // Read in schema file here, from filename indicated above.
        // MUST set schema_filename before calling!
        if(schema_filename.equalsIgnoreCase("")) throw new IOException();

        // Open schema_map file
        BufferedReader inFile = null; // create a new stream to open a file
        try {
            inFile = new BufferedReader((Reader) new FileReader(schema_filename));
            String data = " ";

            // Read in each line until termination
            while ((data = inFile.readLine()) != null) {
                // Split from last element of string
                String[] schema_entry = data.split(" ", 2);

                // Check for schema meta-info
                if (schema_entry[0].equalsIgnoreCase("strictElementPaths"))
                    strictElementPaths = true;
                else
                    // schema_map.put(addTagSet( - last word - ), - first part -
                    // );
                    schema_map.put(addTagSet(schema_entry[1]), schema_entry[0]);

            }
        } finally {
            if (inFile != null)
                inFile.close();
        }

        // Open schema presets file, if there is one
        if (!schema_presets.equalsIgnoreCase("")) {

            // Open schema_map file
            inFile = null; // create a new stream to open a file
            try {
                inFile = new BufferedReader((Reader) new FileReader(
                        schema_presets));
                String data = " ";

                // Read in each line until termination
                while ((data = inFile.readLine()) != null) {
                    // Split from last element of string
                    String[] schema_entry = data.split(" ", 2);

                    presets_map.put(schema_entry[0], schema_entry[1]);

                }
            } finally {
                if (inFile != null)
                    inFile.close();
            }
        }

        // Cycle through all files; XmlFilter (below) makes sure each is
        // an XML file.

        File directory = new File( pathname );

        String[] eadFiles = directory.list( new XmlFilter() );

        // for (int i = 0; i < eadFiles.length; i++) { // Uncomment this and comment out the following line
        for (int i = 0; i < 4; i++) { // Used to limit number of records parsed for test purposes

            filename = eadFiles[i];
            eadRecords.add(parse(pathname + filename)); // Returns all the EAD data
                                                        // from ONE file to
                                                        // eadRecords

            System.out.println("Successfully parsed " + pathname + filename + "!");

            datum = new CRRA_Datum(); // This is REALLY, REALLY IMPORTANT! Bad things will happen if it is deleted!
        }

        System.out.println("Number of records = " + eadRecords.size());

        return eadRecords;

    }

    // Used to create schema for parsing
    private static LinkedHashSet<String> addTagSet(String string) {

        // Parse tags from tagSet and add individually as elements to a LHS<S>

        LinkedHashSet<String> returnSet = new LinkedHashSet<String>();
        String[] tagSet = string.split(" ");

        for(int i = 0; i < tagSet.length; i++){
            returnSet.add((String) tagSet[i]);
        }

        return returnSet;
    }

    // To get the number of records found
    public static int getRecordQuantity() {
        return recordQuantity;
    }

    // To get the LinkedHashSet<CRRA_Datum> eadRecords
    public static LinkedHashSet<CRRA_Datum> getEadRecords() {
        return eadRecords;
    }

    // To return whether the parse succeeded
    public static boolean getFailFlag() {
        return failFlag;
    }

    // Parsing a given file into an CRRA_Datum
    private static CRRA_Datum parse(String filename)
            throws IOException {

        DefaultHandler handler = new EADHandler(); // Parse the file using the
                                                    // handler and given schema
        parseXmlFile(filename, handler, false);

        // Add preset fields
        // Get the keySet of presets_map
        Set<String> preset_fields = presets_map.keySet();
        // Iterate over the keySet and put all values into the fields of the new datum
        Iterator iter = preset_fields.iterator();
        while(iter.hasNext()){
            String field = (String) iter.next();
            datum.setField(field, presets_map.get(field));
        }

        // Open file for sending to 'fullrecord' field
        BufferedReader inFile = null; // create a new stream to open a file
        try {
            inFile = new BufferedReader((Reader) new FileReader(filename));
            String data = " ";
            while ((data = inFile.readLine()) != null) {
                datum.concatenateField("fullrecord", data);

            }
        } finally {
            if (inFile != null)
                inFile.close();
        }

        return datum;
    }

    public static CRRA_Datum returnCurrentCD(){
        return datum;
    }

    /*
    * EADHandler looks for the appropriate parts of the EAD record to grab.
    *
    * A stack (tagStack) is used to keep track of the element "pathname".
    * Every time a new element is reached, its name is put on the stack;
    * when a close-element is encountered, its name is popped off, along
    * with any elements "on top" of it (like <p> tags and the like).
    *
    * If stricElementPaths is 'true', then data will be collected for VuFind
    * if and only if the elements on the tagStack match exactly at least one
    * set of elements in the schema.
    *
    * The currentFieldSet is the set of VuFind fields to send the current data
    * to. Every time an element is encountered, the CFS gets wiped out and
    * recalculated from scratch. Inelegant, but effective.
    *
    * If the CFS is not empty, then grabCharacters is true and data will be sent
    * to at least one field.
    *
    */
    static class EADHandler extends DefaultHandler {

        public void startElement(String uri, String localName, String qName,
                Attributes attributes) throws SAXException {

            // qName determines which field, or possible fields, the characters
            // go in.

            // Add the tag name to the stack
            tagStack.addFirst(qName);

            // Initialize the CFS
            currentFieldSet = new LinkedList<String>();

            // Update the CFS to include only fields corresponding to the tags currently on the stack.
            updateCFS();

            if(currentFieldSet.isEmpty())
                grabCharacters = false;
            else
                grabCharacters = true;


        }

        // Overriden to remove closed tags from the tagStack and update the CFS.
        public void endElement(String uri, String localName, String qName)
                throws SAXException {

            // Removes any non-closed tags on the front of the stack, plus the closed tag.
            while(tagStack.contains(qName.toString())){
                tagStack.removeFirst();
            }

            updateCFS();

        }

        public void characters(char[] ch, int start, int length)
                throws SAXException {

            try {
                datum.concatenateField("allfields", new String(ch, start, length));
            } catch (IOException e1) {
                e1.printStackTrace();
            }

            if (grabCharacters) {
                try {

                    // Update each field in 'currentFieldSet' in the current 'datum'
                    Iterator cfsIter = currentFieldSet.iterator();
                    while(cfsIter.hasNext()){

                        fieldName = (String) cfsIter.next();

                    datum.concatenateField(fieldName, new String(ch, start,
                            length)
                            + "\n\t");
                    }

                    // Eliminates unnecessary whitespace
                    datum.setField(fieldName, datum.returnField(fieldName).trim());

                } catch (IOException e) {
                    System.err
                            .println("*** Saving parsed data to CRRA Datum failed! ***");
                    e.printStackTrace();
                } finally {
                    grabCharacters = false;
                }
            }
        }

    }

    // Parses an XML file using a SAX parser.
    // If validating is true, the contents is validated against the DTD
    // specified in the file.
    public static void parseXmlFile(String filename, DefaultHandler handler,
            boolean validating) {
        try { // Create a builder factory
            SAXParserFactory factory = SAXParserFactory.newInstance();
            factory.setValidating(validating); // Create the builder and parse
                                                // the file
            factory.newSAXParser().parse(new File(filename), handler);

        } catch (SAXException e) { // A parsing error occurred; the xml input is
                                    // not valid
            System.err.println("*** SAX Exception ***");
            e.getStackTrace();
            failFlag = true;
        } catch (ParserConfigurationException e) {
            System.err.println("*** Parser Configuration Exception ***");
            e.getStackTrace();
            failFlag = true;
        } catch (IOException e) {
            System.err.println("*** IO Exception in parseXmlFile ***");
            e.getStackTrace();
            failFlag = true;
        } // End try-catch
    }

    public static void updateCFS() {

        Set<LinkedHashSet<String>> keys = schema_map.keySet();
        Iterator tagIter = keys.iterator();
        while(tagIter.hasNext()){
            LinkedHashSet<String> tempSet = (LinkedHashSet<String>) tagIter.next();

            if(tagStack.containsAll((Collection<String>) tempSet) && (!strictElementPaths ||
                    (tempSet.containsAll((Collection<String>) tagStack)))){
                currentFieldSet.add(schema_map.get(tempSet));
            }
        }

    }

}

Project 2 - Sending Data to the Index

The data thus collected had to be sent to our Solr installation.

/**
*
*/
package crrasolrindexer;

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.*;

import org.apache.solr.client.solrj.*;
import org.apache.solr.client.solrj.impl.*;
import org.apache.solr.common.*;

/**
* @author slittle2
*
*   Indexer simply turns an IndexDatum into a URL and sends it to the
*   Solr server to be indexed. :-) Sounds simple, right?
*
*   Scratch that. Indexer *actually* uses Solrj to communicate with Solr.
*   No URLs to send, etc.
*
*/
public class Indexer {

    /**
    * @param args
    * @throws IOException
    * @throws SolrServerException
    */

    // Available for testing
    public static void main(String[] args) throws SolrServerException, IOException, MalformedURLException {

       boolean addNewData = false;

       // This is being tested on a multicore configuration
       String url = "http://localhost:8983/solr/core0/";
       SolrServer server = new CommonsHttpSolrServer(url);


       // Use the following code if you want to start with a clean sweep,
       // eliminating all contents of the index before proceeding.
       /* try {

           server.deleteByQuery("*:*");// Clean test up -- delete everything!
       } catch (SolrException e) {
           System.err.println("*** Delete failed ***");
           System.err.println("(probably passed the wrong URL)");
       }*/

       if (addNewData) {

           try {

               // Creates a faux index record to test sending data to the Indexer.
               // Note that this uses the IndexDatum, not the CRRA_Datum.
               SolrInputDocument doc1 = new SolrInputDocument();
               doc1.addField("key", "12345");
               doc1.addField("author", "Billy-bob Shakespeare");
               doc1.addField("title", "Henry 500");
               doc1.addField("date", "1950-?");
               doc1.addField("note", "Not a real entry.");
               doc1.addField("subject", "Hick classics.");
               doc1.addField("text", "Some text here.");
               doc1.addField("type", "MARC");

               Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
               docs.add(doc1);

               server.add(docs);

           } catch (SolrException e) {
               System.err.println("*** Add data failed ***");
               System.err
                       .println("(probably mismatch with field names in schema.xml)");
           }

       }

       server.commit(); // Always a good idea

    }

    // Uses the Index Datum class. It is recommended that you use indexCD (below)
    // instead.
    public static void indexID(LinkedHashSet<IndexDatum> setOfID, String urlSolr) throws SolrServerException, IOException{

       SolrServer server = new CommonsHttpSolrServer(urlSolr);
       Iterator<IndexDatum> iter = setOfID.iterator();
       IndexDatum singleRecord = null;

       SolrInputDocument doc1 = null;
       Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();

       System.out.println("Number records in Indexer: " + setOfID.size());

       while (iter.hasNext()) { // Add each IndexDatum to the Index

           singleRecord = (IndexDatum) iter.next();

           try {

               doc1 = new SolrInputDocument();
               doc1.addField("key", singleRecord.returnField("key"));
               doc1.addField("author", singleRecord.returnField("author"));
               doc1.addField("title", singleRecord.returnField("title"));
               doc1.addField("date", singleRecord.returnField("date"));
               doc1.addField("note", singleRecord.returnField("note"));
               doc1.addField("subject", singleRecord.returnField("subject"));
               doc1.addField("text", singleRecord.returnField("text"));
               doc1.addField("type", singleRecord.returnField("type"));

               docs.add(doc1);

           } catch (SolrException e) {
               System.err.println("*** Add data failed ***");
               System.err
                       .println("(probably mismatch with field names in schema.xml)");
           }

       }

       server.add(docs);

       server.commit();

    } // end indexID method


    // Uses the CRRA_Datum class. This is recommended.
    public static void indexCD(LinkedHashSet<CRRA_Datum> setOfID, String urlSolr) throws SolrServerException, IOException{

       SolrServer server = new CommonsHttpSolrServer(urlSolr);
       Iterator<CRRA_Datum> iter = setOfID.iterator();
       CRRA_Datum singleRecord = null;

       SolrInputDocument doc1 = null;
       Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();

       String schema_names = CRRA_EADRetriever.returnCurrentCD().returnSchemaNames();
       String[] schema_array = schema_names.split(" ");

       System.out.println("Number records in Indexer: " + setOfID.size());


       while (iter.hasNext()) { // Add each CRRA_Datum to the Index

           singleRecord = (CRRA_Datum) iter.next();

           try {

               doc1 = new SolrInputDocument();

               // Iterate through schema field names, passing them to singleRecord.
               // Then add the associated results to doc1.
               for (int i = 0; i < schema_array.length; i++){
                   String fieldName = schema_array[i];
                   doc1.addField(fieldName, singleRecord.returnField(fieldName));
               }

               docs.add(doc1);

           } catch (SolrException e) {
               System.err.println("*** Add data failed ***");
               System.err
                       .println("(probably mismatch with field names in schema.xml)");
           }

       }

       server.add(docs);

       server.commit();

    } // end indexCD method

    // Deletes a record or records as indicated by the
    // Solr-format query string. Example: "*:*" deletes everything.
    public static void deleteRecord(String query, String url){

       try {
           SolrServer server = new CommonsHttpSolrServer(url);
           server.deleteByQuery(query);
           server.commit(); // Always a good idea
       } catch (MalformedURLException e) {
           e.printStackTrace();
       } catch (SolrServerException e) {
           e.printStackTrace();
       } catch (IOException e) {
           e.printStackTrace();
       }
    }

}

A Little Goes a Long Way