Single Document Tutorial Example

This tutorial is not yet complete, and some parts of it are inaccurate. Please use the User Manual until the tutorial is completed.

Since the Infomap software is incapable of constructing documents from document fragments stored in multiple files (see the tutorial background information), a single-document corpus must always consist of a single file. For the Infomap software to produce meaningful results it needs a good deal of data, so the single document in our corpus will be a rather large document: the King James Bible. The King James Bible is freely available in electronic form from the Gutenberg Project. (You can also download a copy from here.)

Please download this file to your computer now. On my computer, this file is stored as /home/cederber/corpora/kjbible/kjbible.txt, and that is how I will refer to it in the examples that follow. Don't forget to substitute the name of the file on your system when you follow along with the examples.

The first step in building a model from this corpus is to add appropriate annotations to the kjbible.txt file so that the Infomap software can process it. In this example, where the entire corpus is to be treated as a single document, we need to add a total of four XML-like tags, as follows:

  1. A single <DOC> tag, at the very beginning of the file. This tag marks the beginning of a document.
  2. A single <TEXT> tag, before the beginning of the text of the Bible itself, but after the header information appearing at the beginning of the file, which describes the Gutenberg Project and so forth. This tag should be inserted between the line reading
    The First Book of Moses:  Called Genesis
    and the line reading
    1:1 In the beginning God created the heaven and the earth.
    The <TEXT> tag marks the beginning of the part of each document that is actually considered by the Infomap NLP software as it builds a model. The text between the <DOC> tag and the <TEXT> tag can provide meta-information about the document, such as the document's title, but such information is not interpreted by the Infomap software. This meta-information text may contain other XML-like tags (such as <TITLE>).
  3. A single </TEXT> tag, coming after the text of the document and before the </DOC> tag (below). Any information between the </TEXT> and </DOC> tags is ignored when building a model. In this example, add the </TEXT> tag to the very end of the file.
  4. A single </DOC> tag. Add this tag to the very end of the file, after the </TEXT> tag that you added in step 3.

With these tags in place, the corpus is in the proper format and is ready to be processed by the Infomap software. Two steps remain: compiling the software and using it to process the corpus.

To compile the software, simply type make in the top-level directory of the Infomap software distribution. With a little luck, the compilation will run without errors. If it fails, most likely you can get it to work with a few minor changes. Please see this guide for instructions. You can also contact us for help.

Once the software is compiled, you should make some changes to the Makefile to indicate the location of the corpus. Changes the values of Makefile variables as follows:

CORPUS_NAME = kjbible
CORPUS_LONG_NAME = King James Bible
CORPUS_DESCRIP = The King James Bible, from the Gutenberg Project

CORPUS_FILE = /home/cederber/corpora/kjbible/kjbible.txt
CORPUS_DIR = /home/cederber/corpora/kjbible/
These changes tell the software where the corpus is, and give the corpus various descriptive names to be used for different purposes. (Remember that the CORPUS_FILE and CORPUS_DIR variables should have values indicating where you saved the King James Bible file on your system; the values above are exampls taken from my system.)

When the software runs, it will produce a model, which will consist of a number of files stored in a single directory. This model will be created in a working directory; it can later be installed to a more permanent location. You should create two empty directories for these purposes, and change the Makefile variables WORKING_DATA_DIR and INSTALLED_DATA_DIR to the names of these directories.

All of the other Makefile variables can be left as they are. You are now ready to run the software; type make data to do so.

This process will take a little while; if it fails, first look for obvious causes, such as a misspelled directory name. If you get stuck, please contact us and describe the problem.

Once make data has completed, type make installdata. When this is done, the model is installed and ready for use.

Using the Model

Now you can see the results! Run the associate program in the search/ subdirectory of the top-level infomap/ directory. You should see something like the following:

No query terms specified.

Usage:  `associate [-w | -d | -q] 
                ( [-t toc_file] [-c model(corpus)_tag] | [-m model_dir] )
                [-n num_neighbors] [-f vector_output_file]
                 [pos_term_2 ... pos_term_n]
                [NOT neg_term_1 ... neg_term_n]'

        Task:   -w      associate words (DEFAULT)
                -d      associate documents
                -q      print query vector

        Models (default.toc):
                kjbible (kjbible) (DEFAULT)

Note that the bottom of this output indicates that the kjbible model has been installed.

Try the command associate moses. This will return the words in the King James Bible that are conceptually "most similar" to moses in this model. You should see something like this:

Play around with some other queries to get a sense for what has been learned. Keep in mind that by NLP standards, the Bible is quite a small corpus, so results may be of inconsistent quality.

To see how to annotate a corpus to contain multiple documents, and what can be done with such a corpus, continue to the single file example.