Infomap Implementation Guide

A DBM file; actually two files (art2offset.dir and art2offset.pag). Each key in this DB is an article ID; the corresponding value is the offset into artvec.bin at which the vector for that article can be found.

artvec.bin

This file contains the WordSpace vectors for the articles (documents) in the corpus from which this model was built. To find the vector for a particular article, we look up that vector's offset using the article's ID in the art2offset DBM database.

coll

word is the word whose dictionary entry this is. term_freq is the number of times that word occurred in the training corpus. doc_freq is the number of different documents (or articles) in the corpus in which word occurred. is_stop has a non-zero value if word is a stopword (that is, if it appeared in the stoplist), and is 0 otherwise.

This is one of two files representing the co-occurrence matrix in a format that can be read by the SVD code. The other file is indx.

dic

The dictionary. This text file lists all the types encountered in the corpus and their frequency. The types are listed one per line, sorted in decreasing order of frequency. Each line has the format:

 term_freq doc_freq is_stop word

indx

This is one of two files representing the co-occurrence matrix in a format that can be read by the SVD code. The other file is coll.

left

The matrix of left singular vectors generated by SVD of the co-occurrence matrix. These vectors are treated as the word vectors.

matlab

This file is a representation of the co-occurrence matrix that is computed by count_wordvec, in a format that can be used as input by MATLAB. (The same matrix is represented in a different format, used for SVD input, in the files coll and indx.)

The generation of this file is optional, and is controlled by the Makefile variable WRITE_MATLAB_FORMAT. It is not used by any of the Infomap software (either preprocessing-side or search-side), but you may find it useful for hands-on use in MATLAB, for instance to prototype new algorithms.

matrix

model_params.bin

This file contains the parameters with which this model was built, in a binary format. (It is a raw MODEL_PARAMS structure.) This data can be manipulated using the functions declared in model_params.h.

This file contains only those parameters that will be needed by search-side code. Other parameters are instead stored in model_info.bin, so that model_params.bin will be small and fast to load.

model_params.txt

write_text_params

The same information as model_params.bin, in a human-readable (and more portable) textual format.

model_info.bin

Additional model parameters, beyond those stored in model_params.bin

model_info.txt

write_model_params

The same information as model_info.bin, in a human-readable (and more portable) textual format.

corpus_format.bin

write_text_params

Information about the input corpus format and how it was handled; for instance, the tags used to mark the beginning and end of documents, and XML/SGML character entities that have been stripped.

corpus_format.txt

write_model_params

The same information as corpus_format.bin, in a human-readable (and more portable) textual format.

number2name

This is a DBM database (thus two actual files, number2name.dir and number2name.pag). This database maps an article (document) number (ID) to the filename of the file containing the corresponding article. It is relevant only to multiple-file corpora, and will only be produced for such corpora.

Note that the files in a multiple-file corpus must each consist of exactly one corpus document (article).

numFiles

The number of different files that make up the corpus.

offset2art

This is a DBM database (thus two actual files, offset2art.dir and offset2art.pag). This database maps the offset of an article (document) vector in artvec.bin to the article ID.

offset2word

This is a DBM database (thus two actual files, offset2word.dir and offset2word.pag). This database maps the offset of a word vector in wordvec.bin to the word having that vector.

rght

The right singular vectors obtained during SVD. These are not used by the search-side code, and their retention is optional (and controlled by the PRESERVE_RIGHT_SINGVECS variable in the Makefile). For an explanation of SVD and its use by the Infomap software, see this file.

sing

The singular values produced by SVD of the co-occurrence matrix. These are not currently used.

word2offset

This is a DBM database (thus two actual files, word2offset.dir and word2offset.pag). This database maps a word to the offset of that word's vector in the wordvec.bin file.

wordlist

The tokenized corpus: one token per line, plus some lines consisting of formatting information like markers for the beginning and end of documents, and document ID numbers.

wordvec.bin