Corpus Input Formats for the Infomap Software

The Infomap software is capable of processing corpora in two distinct input formats. A single-file corpus consists of a single file that contains the entire corpus; the documents making up the corpus are delimited using XML-like begin-document and end-document tags. A multiple-file corpus has one file per document. These two corpus formats are described in more detail below.

Single-file corpora

In a single-file corpus, the entire corpus (consisting of one or more documents) is containted by a single file on disk. The beginning of each document is marked by a <DOC> tag; then end of each document by a </DOC> tag. Within each document, the beginning of the document's text (as opposed from header information like a title or document number) appears between a <TEXT> tag and a </TEXT> tag. Everying outside these text tags is ignored by the Infomap software.

Multiple-file corpora

In a multiple-file corpus, each disk file that is part of the corpus must contain exactly one document. No tags are used; the entire contents of the file are considered to make up the text of the document and are processed by the Infomap software.