Infomap User Manual

Introduction
Introduction to Examples
Building a Model
Searching Models

Introduction

The Infomap software is based on the concept of a model; each model consists of the files in a directory known as the "model directory" or "model data directory" for that model. The Infomap software performs two basic functions: building models by learning them from a free-text corpus using certain learning parameters specified by the user, and searching an existing model to find the words or documents that best match a query according to that model. After a model has built, it can also be installed to make searching it more convenient, and to allow other users to search it conveniently.

This manual describes how to use the Infomap software to build models and to search models once they have been built. It also discusses installing models.

Model-building and search are performed using an algorithm similar to Latent Semantic Analysis (LSA) (a/k/a Latent Semantic Indexing, or LSI). For details on how the Infomap algorithm works, please see the Infomap Algorithm Description.

Introduction to Examples

To make the discussion of building and searching models more concrete, this manual uses two running examples. Each example involves a fictitious corpus that we imagine to be in one of the two input formats recognized by the model-building software. If you would like to follow a worked example using corpora that are included with the software distribution, see the Infomap tutorial.

The sf corpus is entirely in a single file, /usr/local/share/corpora/sf/sf.txt. The many corpus consists of many files, all of which are in the directory /usr/local/share/corpora/many/. The file /usr/local/share/corpora/manyNames.txt contains the filenames of all of the files making up the many corpus. Details on the formats of these two imaginary corpora (which are the two formats that the Infomap model-building code can currently parse) can be found in this document.

For the sake of later examples, we will imagine that the sf corpus consists of articles published recently in a mainstream U.S. newspaper, and that the many corpus consists of medical journal papers.

Building a Model

Building a model with the Infomap software consists of three steps:

Building (compiling) and installing the Infomap software.
Obtaining a corpus in the appropriate format.
Running the software to build the model.

Building (Compiling) and Installing the Infomap Software

The Infomap software uses the GNU build system, in particular Autoconf and Automake. Therefore, from the top-level directory created when you unpack the distribution tarball, you should run

    $ ./configure
    $ make
    $ make install

The first of these steps determines various things about your system. In particular, it checks for a DBM-compatible database library. If any of these checks fails, you will need to correct the problem before installation can proceed. Please examine the output of configure carefully to determine what the problem is. It may be that you need to install a DBM-compatible library, like GNU DBM (also available as RPM and .deb packages for Linux).

If you have difficulty correcting the problem, please send mail to infomap-nlp-users@lists.sourceforge.net.

The second step (running make) compiles the software. If configure succeeded, then make should run without problems. If make has trouble after a successful configure, please report the problem to infomap-nlp-users@lists.sourceforge.net.

The third step (make install) installs the compiled programs and associated data files. By default, the installed files are placed in various subdirectories of the /usr/local directory. For instance, executable programs are installed in /usr/local/bin, and manual pages are installed in /usr/local/man. Shared data files used by the Infomap NLP software are installed in /usr/local/share/infomap-nlp/.

The advantage of installing to these directories is that the programs and manual pages should automatically become available to all users on your system. The disadvantage is that root access is required to write to /usr/local and its subdirectories. If you need to install to another location, you can use the --prefix option to the configure command. For instance, to install to subdirectories of /home/jrandom/install (i.e. programs go to /home/jrandom/install/bin, man pages go to /home/jrandom/install/man, and so forth), you would use the following sequence of commands:

  $ ./configure --prefix=/home/jrandom/install
  $ make
  $ make install

In this case, you would want to add /home/jrandom/install/bin to your PATH environment variable, and /home/jrandom/install/man to your MANPATH.

Corpus Formats

This user manual and its examples assume that you have a corpus in a format that can be parsed by the Infomap software. It may be necessary for you to convert a corpus's format before using the software on that corpus. The Infomap software accepts corpora in two simple formats, which are described here.

Building a Model With the Infomap Software

Infomap models are built using the infomap-model program, which automatically invokes other Infomap NLP programs. The infomap-build(1) manual page (run man infomap-build to see it) gives details of how to run this program, but imitating the examples below is probably an easier way to use it for the first time.

Building a model from a single-file corpus

Consider building a model from our single-file sf example corpus. Recall that this corpus consists of the single file /usr/local/share/corpora/sf/sf.txt. Let's say that we want this model to be called sf after the name of the corpus, and that we would like the directory containing the sf model files (known as the model data directory) to be created in /home/jrandom/infomap_models.

First, we set the environment variable INFOMAP_WORKING_DIR to /home/jrandom/infomap_models. This environment variable is used by the infomap-build program to determine where it should create model data directories when it builds models. It is called the working directory because it is the place where models are stored as they are being built. (After building, we can use the infomap-install program to install the files from the working directory to a different location.) To set the environment variable in sh or bash, run

  $ INFOMAP_WORKING_DIR=/home/jrandom/infomap_models
  $ export INFOMAP_WORKING_DIR

in csh or tcsh, use

  % setenv INFOMAP_WORKING_DIR /home/jrandom/infomap_models

Once INFOMAP_WORKING_DIR is set, simply run

  $ infomap-build -s /usr/local/share/corpora/sf/sf.txt sf

This command tells infomap-build to build a model called sf from the corpus contained in the single file /usr/local/share/corpora/sf/sf.txt.

When this command is done, the directory /home/jrandom/infomap_models/sf should exist and should contain a number of files. At this point you can search the model using the associate program, or install the model.

To build a model from the many example corpus, we set INFOMAP_WORKING_DIR in the same way. Then, assuming we would like to call this model many_01 we run

  $ infomap-build -m /usr/local/share/corpora/manyNames.txt many_01

This command tells infomap-build to build a model called many_01 from the corpus made up of the files listed in /usr/local/share/corpora/manyNames.txt.

When this command is done, the directory /home/jrandom/infomap_models/many_01 should exist and should contain a number of files. At this point you can search the model using the associate program, or install the model.

Installing an Infomap Model

Before installing a model it is generally a good idea to test the model by searching it using the associate program. See instructions on searching.

Models are installed using the infomap-install program. This program copies some of the model files created by infomap-build from the working directory to a more permanent model directory. There are at least three advantages to doing this:

Making models available to others. Installing useful models in a systemwide directory can make it much more convenient for other researchers to search those models using associate.
Keeping experimental models and known useful models separate. If you create all models initially under one directory, then copy those that prove useful to another directory, you can reduce the risk of accidentally overwriting a model you want to keep around. This will allow you to experiment with model creation with less fear of messing something up.
Conserving disk space. Some intermediate model files that are not needed for search are kept around by infomap-build because they might be useful for other forms of experimentation. By installing only those files needed for search and deleting the working copy of a model, disk space is saved.

A default systemwide model directory is determined when the Infomap NLP software is installed. Typically this directory will be /usr/local/share/infomap-nlp/models. The default behavior of infomap-install is to install models as subdirectories of this directory.

For instance, imagine that we have built the sf model from the sf example corpus as described above, and that the INFOMAP_WORKING_DIR environment variable is set. Then the simple command

  $ infomap-install sf

will copy the sf model data directory and some of its contents from /home/jrandom/infomap_models to the systemwide model directory.

See the infomap-install(1) manual page for more options.

Searching Models

The associate command
Output of the associate command
How the associate command chooses which model to use
Retrieving documents

The `associate` command

Searching is done using the associate command. Details about the operation of this command can be found in the associate(1) man page. We describe common ways of using this command below.

Suppose that we want to search the sf model that we built in the example above, and that we have not yet installed this model using infomap-install. Imagine that we want to find words related to suits (the article of clothing, not lawsuits). Then we run the command

  $ associate -t -c sf suit NOT lawsuit

The suit NOT lawsuit part of this command is the query; it describes what we are looking for. The -c option tells which model to search (think "corpus" to remember this option). The argument to the -c option is the model tag, or name, of the model to be searched. The -t option tells associate to use a temporary working model, rather than an installed model; this means that associate looks for the model data directory in INFOMAP_WORKING_DIR.

How does associate find models that have been installed using infomap-install? It uses the INFOMAP_MODEL_PATH. This environment variable, similar in nature to the PATH and MANPATH variables, contains a list of directories, separated by colons, in which Infomap model data directories might be found. For instance, to look for Infomap models only in the systemwide data directory, you could issue the command

  $ INFOMAP_MODEL_PATH=/usr/local/share/infomap-nlp/models
  $ export INFOMAP_MODEL_PATH

or, under csh

  % setenv INFOMAP_MODEL_PATH /usr/local/share/infomap-nlp/models

With this path, the command

  $ associate -c sf suit NOT lawsuit

would perform the same search as above, but would expect to find the sf model in /usr/local/share/infomap-nlp/models (where it would be copied by infomap-install) rather than in /home/jrandom/infomap-models.

By default, associate retrieves words matching the query; using the -d option tells it to retrieve documents instead.

Output of the `associate` command

The associate command returns as its output a list of the words or documents best matching the query, in descending order of relevance. Each line of output consists of either a word or a document ID, followed by a colon, followed by a similarity score indicating how good a match for the query that word or document is judged to be. In the case of document retrieval, the document ID can then be used to obtain the document itself.

By default, associate will return the 20 words or documents that best match the query. The -n command-line option can be used to override this default and specify how many words or documents to return.

The -f command-line option will cause the matching words and documents, and the WordSpace vector for each, to be written to the specified file. This can be useful for further processing. (For instance, the Infomap Word Spectrum Plotter can use these files as a source of items and coordinates to plot.)

Retrieving documents with document ID's

Once we have run an associate command with the -d option to retrieve documents, then what? We have a list of "document ID's", but what do they mean, and how do we use them to get the documents themselves?

First of all, we should point out that document retrieval fetches documents from the original corpus, so the corpus must be in the same location and have exactly the same format as it did when we built the model in order for document retrieval to function reliably. If this is the case, we can use the print_doc command to look up documents from the document ID's and print them out.

The print_doc command is extremely simple. It accepts one or more document ID's on the command line, and it prints the document corresponding to each ID to standard out. If more than one document is printed, the documents are separated by a blank line, a line containing "===", and another blank line.

The print_doc program determines which model to use in the same way that associate does. That is, it takes a -c option whose argument is the model tag, and searches for a directory of that name in the INFOMAP_MODEL_PATH or, if the -t option has been given, in the INFOMAP_WORKING_DIR. If print_doc finds a model different from the one associate used to retrieve the results, then its results will be meaningless, so be careful.

It is much easier to perform document retrieval with the Infomap web interface, which is planned for future release.

Infomap User Manual

Table of Contents