It is assumed that if you are reading this you have already figured out how to install and run the Infomap code. Instructions for doing this are given in the Infomap User Manual. This document is intended to give you a more thorough understanding of how the software is actually implemented; it should help you if you are planning to extend or modify the code, or if you are just curious about how it works.
The Infomap software has two fundamental functions: automatically learning a vector-space model of the words and documents in a corpus, and using the resulting model to perform search in which queries can be used to retrieve matching words or matching documents. We refer to the code that performs the former of these functions, the learning, as the preprocessing code or the preprocessing-side code. We refer to the code that performs the latter function, the search, as the search code or the search-side code. Communication between the preprocessing-side code and the search-side code happens via a model, which is the object learned by the preprocessing-side code. In the current implementation, each model consists of a collection of all the files in a single directory, the model directory or model data directory.
The Infomap software distribution, once unpacked from its distributed form (as, e.g., a tar file), consists of a single top-level directory containing some files and some subdirectories. The preprocessing-side code is found in the preprocessing/ and svd/ directories; the search-side code is in the search/ directory.
Since a model is the means of communication between preprocessing-side and search-side code, the names of the files making up the model, the format of those files, and the semantics of each file, must be agreed upon by the preprocessing-side and search-side programs. If changes are made to the names or contents of the files making up a model, both the preprocessing-side and search-side code will need to be recompiled, possibly after being modified to account for the changes. Furthermore, models built before these changes may no longer be compatible with search-side code compiled after the changes.
On the other hand, so long as the format of a model is not changed, compatibility between the search-side code and preprocessing-side code should be maintained. This means, for one thing, that multiple models can be built from different corpora (or from the same corpus, using different parameters) only by changing the appropriate variables in the top-level Makefile; these models can be learned and used for search without the need to do any recompilation. Also, the algorithm used for learning could be altered by making changes to the preprocessing-side code; so long as the new algorithm still produced a model in the same format, the search-side code would not need to be changed or recompiled. Likewise, new options or techniques could be added to the search-side code and the modified code could still be used to search existing models.
All Infomap code refers to the files making up a model using compiled-in constants that are #define'd in the file filenames.h in the top-level directory. This file is automatically generated by make from the file Makefile.filenames, also in the top-level directory.
The learning of a model is accomplished by a collection of programs that run in series; each program produces output files that may be used as input files by later programs. The two tables below describe, respectively, the preprocessing-side programs, listing the input and output files for each and giving a brief description, and the files making up a model, listing the program that produces each and the programs that use it. These tables are intended as a guide to or framework for understanding the Infomap code.
Program | Input File(s) | Output File(s) | Description | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
prepare_corpus | corpus |
|
This program tokenizes the corpus and writes it in a format that is more easily manipulated by other preprocessing-side programs. It also writes a "dictionary" file (dic) listing the words encountered in the corpus and the term and document frequency of each; the list is sorted in decreasing order of term frequency. |
||||||||||
count_wordvec |
|
|
This program reads the tokenized corpus (wordlist) and associated dictionary (dic) and produces the co-occurence matrix that will later be reduced by SVD. Part of this process is choosing which "content-bearing" words will be used as column labels. (For more about content-bearing words, see the algorithm description.) The co-occurence matrix is output in a format suitable for use by SVDPACKC to perform SVD (the indx and coll files) and, optionally, in Matlab format (the matlab file) to enable hands-on manipulation. |
||||||||||
../svd/svdinterface/svdinterface |
|
|
The svdinterface program (which is in the svd/svdinterface/ subdirectory of the top-level Infomap directory, not in the preprocessing/ subdirectory like all of the other programs discussed here) is a relatively simple wrapper for the SVDPACKC library from the University of Tennessee. The co-occurence matrix stored in the coll and indx files is reduced by SVD, producing the left singular vectors (left), right singular vectors (rght), and singular values (sing). The left singular vectors are treated as the WordSpace word vectors, and will later be encoded by encode_wordvec into a more compact binary representation. The right singular vectors (in the file rght) may or may not be preserved upon completion of the svdinterface program, depending on a Makefile variable. For a principled explanation of SVD and its use by the Infomap software, see the algorithm description. |
||||||||||
encode_wordvec |
|
|
This program reads in the textual word vectors (in the file left) that were obtained using SVD by the svdinterface program, and writes them back out in a binary format to the file wordvec.bin. This is a more compact, although less portable, represenation. (The raw C numeric data is simply written to disk; this is not portable due to differences in hardware and compliers.) To enable lookup, two DBM databases are created. The word2offset database allows the offset of a word's vector in the wordvec.bin file to be looked up using the word as a key. This is used by the search code to obtain the vectors for the words in a query. The offset2word database allows the inverse lookup: the offset of a word's vector in the wordvec.bin file can be used as a key to look up the word. This is used by the search code to determine which words it should return once it has determined which word vectors are the best match for a query vector. |
||||||||||
count_artvec |
|
|
This program uses the tokenized corpus (wordlist) and the word vectors already computed and stored in wordvec.bin to compute WordSpace vectors for the articles (documents) that make up the training corpus. These vectors can be used by the search code for information retrieval: those articles whose vectors most closely match the query vector are returned. A binary representation of the article vectors is written to artvec.bin after these vectors are computed.x To enable lookup, two DBM databases are created. The art2offset database allows the offset of an article's vector in the artvec.bin file to be looked up using the article's ID as a key. The offset2art database allows the inverse lookup: the offset of an article's vector in the artvec.bin file can be used as a key to look up the article's ID. This is used by the search code to determine which articles it should return once it has determined which article vectors are the best match for a query vector. In the case of single-file corpora, the article ID is the offset into the single corpus file at which the article begins. The end of the article is marked by a tag. In the case of multiple-file corpora, the article ID is an index into the number2name DBM, which will map it to the filename of the file that consists entirely of the article. This scheme explains why multi-file corpora must have a strict 1-1 document-file correspondence. |
||||||||||
write_text_params |
|
|
write_text_params translates the three binary parameters files into textual versions. This allows easy visual inspection of these objects, and could enable their transfer to another machine, since the binary versions are not portable across architectures. |
Filename | Produced By | Used By | Description | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
art2offset | count_artvec | A DBM file; actually two files (art2offset.dir and art2offset.pag). Each key in this DB is an article ID; the corresponding value is the offset into artvec.bin at which the vector for that article can be found. |
|||||||||
artvec.bin | count_artvec | This file contains the WordSpace vectors for the articles (documents) in the corpus from which this model was built. To find the vector for a particular article, we look up that vector's offset using the article's ID in the art2offset DBM database. |
|||||||||
coll | count_wordvec | svdinterface | This is one of two files representing the co-occurrence matrix in a format that can be read by the SVD code. The other file is indx. |
||||||||
dic | prepare_corpus |
|
The dictionary. This text file lists all the types encountered in the corpus and their frequency. The types are listed one per line, sorted in decreasing order of frequency. Each line has the format: term_freq doc_freq is_stop wordword is the word whose dictionary entry this is. term_freq is the number of times that word occurred in the training corpus. doc_freq is the number of different documents (or articles) in the corpus in which word occurred. is_stop has a non-zero value if word is a stopword (that is, if it appeared in the stoplist), and is 0 otherwise. |
||||||||
indx | count_wordvec | svdinterface | This is one of two files representing the co-occurrence matrix in a format that can be read by the SVD code. The other file is coll. |
||||||||
left | svdinterface | encode_wordvec | The matrix of left singular vectors generated by SVD of the co-occurrence matrix. These vectors are treated as the word vectors. |
||||||||
matlab | count_wordvec | This file is a representation of the co-occurrence matrix that is computed by count_wordvec, in a format that can be used as input by MATLAB. (The same matrix is represented in a different format, used for SVD input, in the files coll and indx.) The generation of this file is optional, and is controlled by the Makefile variable WRITE_MATLAB_FORMAT. It is not used by any of the Infomap software (either preprocessing-side or search-side), but you may find it useful for hands-on use in MATLAB, for instance to prototype new algorithms. |
|||||||||
matrix | |||||||||||
model_params.bin |
|
|
This file contains the parameters with which this model was built, in a binary format. (It is a raw MODEL_PARAMS structure.) This data can be manipulated using the functions declared in model_params.h. This file contains only those parameters that will be needed by search-side code. Other parameters are instead stored in model_info.bin, so that model_params.bin will be small and fast to load. |
||||||||
model_params.txt | write_text_params | The same information as model_params.bin, in a human-readable (and more portable) textual format. |
|||||||||
model_info.bin |
|
|
Additional model parameters, beyond those stored in model_params.bin |
||||||||
model_info.txt | write_model_params | The same information as model_info.bin, in a human-readable (and more portable) textual format. |
|||||||||
corpus_format.bin | prepare_corpus | write_text_params | Information about the input corpus format and how it was handled; for instance, the tags used to mark the beginning and end of documents, and XML/SGML character entities that have been stripped. | ||||||||
corpus_format.txt | write_model_params | The same information as corpus_format.bin, in a human-readable (and more portable) textual format. |
|||||||||
number2name | prepare_corpus | This is a DBM database (thus two actual files, number2name.dir and number2name.pag). This database maps an article (document) number (ID) to the filename of the file containing the corresponding article. It is relevant only to multiple-file corpora, and will only be produced for such corpora. Note that the files in a multiple-file corpus must each consist of exactly one corpus document (article). |
|||||||||
numFiles | prepare_corpus | count_wordvec | The number of different files that make up the corpus. |
||||||||
offset2art | count_artvec | This is a DBM database (thus two actual files, offset2art.dir and offset2art.pag). This database maps the offset of an article (document) vector in artvec.bin to the article ID. |
|||||||||
offset2word | encode_wordvec | This is a DBM database (thus two actual files, offset2word.dir and offset2word.pag). This database maps the offset of a word vector in wordvec.bin to the word having that vector. |
|||||||||
rght | svdinterface | The right singular vectors obtained during SVD. These are not used by the search-side code, and their retention is optional (and controlled by the PRESERVE_RIGHT_SINGVECS variable in the Makefile). For an explanation of SVD and its use by the Infomap software, see this file. |
|||||||||
sing | svdinterface | The singular values produced by SVD of the co-occurrence matrix. These are not currently used. |
|||||||||
word2offset | encode_wordvec | This is a DBM database (thus two actual files, word2offset.dir and word2offset.pag). This database maps a word to the offset of that word's vector in the wordvec.bin file. |
|||||||||
wordlist | prepare_corpus |
|
The tokenized corpus: one token per line, plus some lines consisting of formatting information like markers for the beginning and end of documents, and document ID numbers. |
||||||||
wordvec.bin | encode_wordvec | count_artvec | This file contains the WordSpace vectors for the words in the corpus from which this model was built. To find the vector for a particular word, we look up that vector's offset in this file using the word as a key into the word2offset DBM database. |