A Little Goes a Long Way: Project 1, Part 2: Searching the Internet Archive

Once I extracted the data from the MARC records, the next step was to insert it into a URL that could query the Internet Archive and retrieve the resulting XML file.

package iasearcher;
/**
*
*/

/**
* @author slittle2
*
*    This version of IASearcher takes a URL from the file indicated by
*    the variable "searchfile" and sends it to the Internet Archive
*    to search for the given data. It receives an XML file in return.
*     (In test runs, the URL used is hardcoded.)
*
*     Next it parses the XML file and finds the "numFound" value:
*         = 0 : Record title/author/URL in "No luck" log file
*         = 1 : Record title/author/key in "1 Hit" log file
*         > 1 : Record title/author/keys in "Multi-hit" log file
* All values are separated by tabs in these values, with newlines
* separating records.
*
*    Finally, it converts each of the "1 Hit" keys into a URL and saves it into
*    another file. This file can then be fed to 'wget' or a similar program
*    to retrieve data folders associated with each key from the IA.
*
*    Later versions will:
* - determine which of the Multi-hit records returned is best, and save that data to uniquely named file
*
* Or the later versions might not, since this project probably won't go forward, at least not in its current incarnation.

Note that there's no checking to see whether the hits returned are actually relevant or whether they are false positives. In a more robust version, we'd want to take the time to do that.
*/

import java.io.*;
import java.net.*;

public class IASearcher {

    /**
    * @param args
    * @throws IOException
    */

    public static void main(String[] args) throws IOException {
Most of the String variables here are pathnames to various files in my workspace. Note that most of them won't be used unless the indicated line following them is un-commented.
        String searchfile = "C:/Documents and Settings/slittle2/workspace/MarcRetriever/searchkeys.txt";
        String outFileName = "C:/Documents and Settings/slittle2/workspace/MarcRetriever/output.xml";
        String noResultsLog = "C:/Documents and Settings/slittle2/workspace/MarcRetriever/noResults.txt";
        String oneResultLog = "C:/Documents and Settings/slittle2/workspace/MarcRetriever/Success Files 5-26/oneResult.txt";
        String manyResultsLog = "C:/Documents and Settings/slittle2/workspace/MarcRetriever/manyResults.txt";
        String failures = "C:/Documents and Settings/slittle2/workspace/MarcRetriever/failures.txt";
        String keyURLfile = "Z:/IASearch/keyURLs.txt";
        Two parts here: get the author/title pairs that I just saved to a file back from the file (a bit of a waste, in hindsight), then create the URLs to be fed one at a time to the Archive.
        // This code is currently commented out so as to focus testing on the method following it
        // retrieveKeys(searchfile, outFileName, noResultsLog, oneResultLog, manyResultsLog, failures);

        createKeyURLS(oneResultLog, keyURLfile, true);

    }

A couple of different interactions take place with the Archive here. retrieveKeys sends author/title info and retrieves a 'key'--an alphanumeric code that refers to a unique text on the Archive. createKeyURLS then creates a new URL that can be used by wget (or a similar program) to mirror the texts themselves.

Note that the methods are given here in the reverse order from the way in which they are called.

    // Takes the keys from the "1 Hit" results, inserts them into URLs for the IA, and outputs them to a file.
    private static void createKeyURLS(String oneResultLog, String keyURLfile, boolean redirect)
            throws IOException {

        BufferedReader inFile = null; // create a new stream to open a file
        FileWriter outFile = null; // creates a new stream to output to a file
        final String addressRoot = "http://www.archive.org/download/";

        // Set the 'in' file to the search file
        try {
            inFile = new BufferedReader((Reader) new FileReader(oneResultLog));
            outFile = new FileWriter(keyURLfile, false);
            String data = " "; This is what gets read from the input file initially
            String key = ""; The "key" is is a code returned from the initial querying of the Archive.
            String newURL = "";

            // Retrieve URLs from file & send to Internet Archive
            while ((data = inFile.readLine()) != null) {

                // Extract keys
                String[] splitData = data.split("\t");

                // String author = splitData[0];
                // System.out.println(author);
                // String title = splitData[1];
                key = splitData[2];

                newURL = addressRoot + key + "/";

                // Code to avoid URL redirects (new URLs were causing wget to hiccup)
                if(redirect) newURL = redirectAndTrim(newURL);
                Querying the Archive with the key-URL causes the Archive to forward wget to another URL. The forwarded URL, when wget tried to use it, crashed on account of some whitespace control characters at the end. So I programmed this class to simply send the key-URLs to the Archive and retrieve the redirected addresses on wget's behalf. Then this class trims off the whitespace and saves a perfectly usable URL. :-)

                // Create URLs, & output URLs to file
                if (newURL != null) outFile.append(newURL + " \n");

            }

        } catch (MalformedURLException e) {
            System.err.println("*** Malformed URL Exception ***");
        } catch (FileNotFoundException e) {
            System.err.println("*** File not found! ***");
            e.printStackTrace();
        } catch (IOException e) {
            System.err.println("*** IO Exception ***");
            e.getStackTrace();
        } finally {
            if (inFile != null)
                inFile.close();
            if (outFile != null)
                outFile.close();
        }
Again, it's always a good idea to close your streams, excepting the System standard input/output.
    }

    protected static String redirectAndTrim(String key) throws IOException { In the end, the "AndTrim" part was unnecessary, since the redirected URLs were automatically stripped of the offending characters before my class ever saw them. Hence no trimming was necessary.
        // Retrieve the redirected URL from IA

        URI uri = null;
        URL url = null;
        InputStream inURI = null;

        try {

            // Open connection to IA
            uri = new URI(key);

            url = uri.toURL();


            URLConnection yc = url.openConnection();


            HttpURLConnection h = (HttpURLConnection) yc;
            HttpURLConnection.setFollowRedirects(false);

            System.out.println(h.getHeaderField( "Location" ));

            return h.getHeaderField( "Location" );

            // Catching errors
        } catch (URISyntaxException e) {
            System.err.println("*** URI Syntax Exception ***");
            e.printStackTrace();
        } catch (MalformedURLException e) {
            System.err.println("*** Malformed URL Exception ***");
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            System.err.println("*** File not found! ***");
            e.printStackTrace();
        } catch (IOException e) {
            System.err.println("*** IO Exception ***");
            e.getStackTrace();
        } finally {
            if (inURI != null)
                inURI.close();
        }

        return null;
    }
Here's where the initial call to the Archive takes place: reading in an author and title from the specified file, forming a query URL, querying the Archive, retrieving an XML file of results, and saving the resulting 'key' (if any) to one of three files. (The 'no-hit' file, instead of keys, uses the original URL that was sent.)
    // Retrieves keys from previously constructed key file, and specifies the names of the log files and the single XML temp file.
    public static void retrieveKeys(String searchfile, String outFileName, String noResultsLog, String oneResultLog, String manyResultsLog, String failures) throws IOException {

        BufferedReader inFile = null; // create a new stream to open a file

        int bufferChar = 0;

        URI uri = null;
        URL url = null;
        InputStream inURI = null;

        FileWriter outFile = null; // creates a new stream to output to a file
        int none = 0, one = 0, many = 0, total = 0;

        // Set the 'in' file to the search file
        try {
            inFile = new BufferedReader((Reader) new FileReader(searchfile));

            String data = " ";

            // Retrieve URLs from file & send to Internet Archive
            while ((data = inFile.readLine()) != null) {

                String resultsIdentifiers = "";
                String received = "";
                int resultsQuantity = 0; // Number of results returned for a given query

                // Retrieves author, title, and URL for IA
                String[] splitData = data.split("\t");
                String author = splitData[0];
                String title = splitData[1];
                data = splitData[2];

                // Open connection to IA
                uri = new URI(data);
                url = uri.toURL();
                inURI = url.openStream();

                // Get and print results - Obviously, there's more efficient ways to do this than reading an entire XML file one character at a time. Today I'd take advantage of a BufferedStream or something similar. Live and learn.
                bufferChar = inURI.read();
                received = String.valueOf((char) bufferChar);

                while (bufferChar != -1) {
                    bufferChar = inURI.read();
                    received += (char) bufferChar;
                }

                if (received.length() > 2) {
                    received = prune(received);

                    // TODO change this and IASearcherXMLReader.java so that we
                    // can pass the data to the latter, not save it to a file
                    // repeatedly
                    outFile = new FileWriter(outFileName, false);
                    outFile.append(received);
                    if (outFile != null)
                        outFile.close();


                    // Parse XML and find "numFound" tag; read "numFound" value
                    // and log in appropriate file

                    // Always call this first
                    // before trying to retrieve data
                    IASearcherXMLReader.parse();

                    resultsQuantity = IASearcherXMLReader.getRecordQuantity();
                    resultsIdentifiers = IASearcherXMLReader.getIdentifiers();

                } else {
                    resultsQuantity = 0;
                }

                switch (resultsQuantity) { Depending on the number of hits returned, save the relevant info to a given results file.
                case 0:
                    outFile = new FileWriter(noResultsLog, true);
                    // Write URL and newline
                    outFile.append(author + "\t" + title + "\t" + url + "\n");

                    ++none;
                    break;
                case 1:
                    outFile = new FileWriter(oneResultLog, true);
                    // Write single file key and newline
                    outFile.append(author + "\t" + title + "\t" + resultsIdentifiers + "\n");

                    ++one;
                    break;
                default:
                    outFile = new FileWriter(manyResultsLog, true);
                    // Write all keys, tab-delimited, with newlines, to file
                    outFile.append(author + "\t" + title + "\t" + resultsIdentifiers + "\n");

                    ++many;
                }

                if (outFile != null) outFile.close();

In case of an error on the part of the parser, I created a fourth log file: failures. It simply identifies which record was the source of the error, without attempting any further examination of the source of the error.
                if(IASearcherXMLReader.getFailFlag()){ // Did the parse attempt throw an exception?
                    outFile = new FileWriter(failures, true);
                    outFile.append(data + "\n");
                    if (outFile != null) outFile.close();
                }

                total = none + one + many;

            } // end while on EOF

            // Catching errors
        } catch (URISyntaxException e) {
            System.err.println("*** URI Syntax Exception ***");
        } catch (MalformedURLException e) {
            System.err.println("*** Malformed URL Exception ***");
        } catch (FileNotFoundException e) {
            System.err.println("*** File not found! ***");
            e.printStackTrace();
        } catch (IOException e) {
            System.err.println("*** IO Exception ***");
            e.getStackTrace();
        }finally {
            if(inURI!=null) inURI.close();
            if(inFile!=null) inFile.close();
            if(outFile!=null) outFile.close();
            System.out.println("None: " + none + "\t One: " + one + "\t Many: " + many + "\t Total: " + total);
        }

    }
There always seemed to be a couple odd characters before and after the XML files, so this method trims them.
    // Prunes excess characters from the beginning and end of the XML file
    private static String prune(String data) {

        char scanChar = data.charAt(0);
        int head = 0;
        int tail = data.length() - 1;

        while(scanChar != '<') {
            head++;
            scanChar = data.charAt(head);
        }

        scanChar = data.charAt(tail);
        while(scanChar != '>') {
            tail--;
            scanChar = data.charAt(tail);
        }

        data = data.substring(head, ++tail);

        return data;
    }

}
Voila'!

A Little Goes a Long Way

Thursday, July 1, 2010

Project 1, Part 2: Searching the Internet Archive

No comments:

Post a Comment

Evolution

Followers