Infomap Algorithm Description

The Infomap software in this package takes a corpus of text documents and builds a WORDSPACE in which the words in the corpus are represented by word vectors. A word vector is a list of numbers (called coordinates) which encodes information about how the word is distributed over the corpus. Many experiments have demonstrated that words with similar word vectors often have similar or closely related meanings: the Infomap WORDSPACE can therefore be used to model similarities between different words by automatically comparing their behavior in text.

The main algorithms used in the software are for building coocurrence matrices, concentrating information by reducing the number of dimensions, comparing word vectors using cosine similarity, and carrying out logical operations (so far these include negation and negated disjunction). These steps will be described in turn.


Building coocurrence matrices

Many information retrieval systems start by building a term document matrix, which is a large table with one row for each term and one column for each document: each number then records the number of times a particular word occurred in a particular document. This gives each word a list of numbers, and this list of numbers is called a word-vector. A good way to think of these numbers is as `meaning-coordinates', just as latitude and longitude associate spatial coordinates with points on the surface of the earth. Each word, then, gets assigned a list of coordinates which determines a point in some (often high-dimensional) vector space. Readers who are not completely happy with the language of vectors might find an introductory chapter useful.

When studying words and their properties, term document matrices are not ideal because many similar words are rarely used in the same document: for example, reports of sporting events often mention an umpire or a referee but only rarely use both of these words in the same article, making it difficult to work out that these words are very similar. The Infomap software addresses this by choosing a number of special content bearing words and assigning coordinates to the other words based upon how often then occurred near to one of these content bearing words. This is best illustrated by an example:

HOT-FROM-THE-OVEN MEALS: Keep hot food HOT; warm isn't good enough. Set the oven temperature at 140 degrees or hotter. Use a meat thermometer. And cover with foil to keep food moist. Eat within two hours. ``Change is always happening,'' said the ebullient trumpeter, whose words tumble out almost as fast as notes from his trumpet. ``That's one of the wonderful things about jazz music.'' For many jazz fans, Ferguson is one of the wonderful things about jazz music.
The words music and food are good candidates for content-words: they are normally unambiguous and seem to have clear meanings in terms of which other words can be defined. (Though there are almost always find potentially confusing uses such as "If music be the food of love, play on".) The above examples would begin to give us the following count-data:
eathotjazzmeattrumpet
Music 3 1
Food 1 2 1

We proceed through the whole corpus of documents like this, for each word building up a "number signature" which tells us how often that word appeared in within a window of text near to each of the content bearing words. The size of this window of text can be altered, as can the choice of content bearing words. (A relatively stable method has been to choose as content words the 1000 most frequent words in a document collection after stopwords (such as the, we and of have been removed.) In this way, the WORDSPACE software builds a large coocurrence matrix in which the columns are content bearing words and each rows records the number of times a given word occured in the same region of text as a particular content bearing word.

It is at this stage of the preprocessing that the WORDSPACE software can incorporate extra linguistic information such as part of speech tags and multiword expressions, if these are suitably recorded in the corpus.


Picture of latent dimension

Reducing dimensions

The number of dimensions of any vector space is defined to be the number of coordinates needed to represent each point, so for a table with 1000 numbers in each row, each row represent a point in a 1000 dimensional space. This is often too many dimensions - the points have too much space in which to spread out and this results in very sparse information. There are many ways for concentrating information into a smaller space by reducing the number of dimensions. For example, the diagram on the right shows several words that occur in the contexts of cars and driving. Since these are often versions of the same underlying context, it makes sense to replace the coodinate axes for the words car and drive with a single axis which combines both.

The technique normally used by the Infomap WORDSPACE software is called singular value decomposition, which has been used in information retrieval to reduce the sparseness in standard term document matrices. This process is often called latent semantic indexing or latent semantic analysis. This is only one possibility for reducing the number of dimensions of a dataset: others include probabilistic latent semantic analysis and local linear embedding.

When using the WORDSPACE software at Stanford we have used Mike Berry's SVDPACKC to calculate the singular value decomposition. The licensing for SVDPACKC is not the same as that for the WORDSPACE software itself and if you use SVDPACK you must make sure to obtain the correct licenses. The number of dimensions which you want to reduce to is another parameter which can be readily altered: we have had good results with 100 dimensions, other researchers have found that somewhere between 200 and 300 works best. As with many things, it is reasonable to assume that "best" is partly determined by the task in hand and that the question of "how many dimensions are needed to represent meaning" has many answers in different situations. These vectors are produced using the programs in the preprocessing directory.


Comparing word vectors using cosine similarity

Now that each word is represented by a vector with a suitable number of reduced dimensions, we can begin to compare words with one another to find out if they are similar or different. One standard way to do this is to use the cosine similarity measure. Let the vector a have coordinates (a1, ... , an) and let the vector b have coordinates (b1, ... , bn). Their scalar product is defined to be the sum

a.b = a1b1 + ... + anbn,

and if we divide this by the moduli ||a|| and ||b|| we obtain the cosine of the angle between the two vectors, which is called their cosine similarity

cos(a,b) = (a1b1 + ... + anbn) / (||a|| ||b||).

Again, these operations are described in much more detail in the introduction to vectors. This enables us to find the nearest neighbors of a given word (those with the highest cosine similarity) using the associate program in the search directory.

One of the great benefits of the vector formalism is that it allows us to combine words into sentences or documents by adding their vectors together. If article vectors have been built during the preprocessing phase, the associate program can also be used to find nearby documents and thus for information retrieval.

Schütze sometimes calls such composite vectors context vectors because they gather information from the context surrounding a particular word. Context vectors can be clustered using a variety of clustering algorithms, and the centroids of the different clusters can be used to represent as different senses of words, giving sense vectors.


Quantum connectives in WORDSPACE

Though very successful for many goals including language acquisition and information retrieval, the vector model to date has suffered from serious theoretical drawbacks which researchers are only just beginning to address. The WORDSPACE we have described so far is a pretty blunt instrument: the only operations on vectors which we have introduced so far are addition and the scalar product, neither of which is affected by the order in which the vectors appear (by contrast with real language in which Cain killed Abel and Abel killed Cain have completely different meanings). To begin with, we would like to distinguish between the different logical connectives: searches for A NOT B, A OR B and A AND B should not give the same results. This turns out to be quite simple, at least for negation, where the equation

a NOT b = a - (a.b)b

can be shown to give a vector for the expression "a NOT b" which has zero cosine similarity with the unwanted vector b. Vectors with zero cosine similarity are said to be orthogonal to one another, and so the region of WORDSPACE corresponding to the notion "NOT b" is the subspace of points which are orthogonal to b. Similarly, taking the plane spanned by the vectors a and b gives an expression "a OR b" which is consistent with the idea of negation via orthogonality. These operations have demonstrated interesting strengths (and interesting weaknesses) when compared with the traditional Boolean connectives in document retrieval experiments.

It turns out that precisely the same logical operators on vectors were used by Birkhoff and von Neumann in the 1930s to describe the logic of a quantum mechanical system, which is why the logical operators are called the quantum connectives and the system as a whole is called quantum logic.

The WORDSPACE software currently implements versions of quantum negation and negated disjunction (which for computational reasons turns out to be much more tractable than a positive disjunction).


The development of this software and the underlying methods has received contributions from several researchers during different phases of the Infomap project at the Computational Semantics Laboratory under the guidance of Stanley Peters. Hinrich Schütze pioneered many of the original methods, using the WORDSPACE model for word sense discrimination. Stefan Kaufmann was responsible for writing a new version of the software on which the current release is largely based. Dominic Widdows added logical connectives and incorporated other linguistic information including part of speech tags and multiword expressions. The current version for public release has been managed by Scott Cederberg. Contributions and experiments from several other researchers are described in our Papers.


References