The Infomap software in this package takes a corpus of text documents and builds a WORDSPACE in which the words in the corpus are represented by word vectors. A word vector is a list of numbers (called coordinates) which encodes information about how the word is distributed over the corpus. Many experiments have demonstrated that words with similar word vectors often have similar or closely related meanings: the Infomap WORDSPACE can therefore be used to model similarities between different words by automatically comparing their behavior in text.
The main algorithms used in the software are for building coocurrence matrices, concentrating information by reducing the number of dimensions, comparing word vectors using cosine similarity, and carrying out logical operations (so far these include negation and negated disjunction). These steps will be described in turn.
Many information retrieval systems start by building a term document matrix, which is a large table with one row for each term and one column for each document: each number then records the number of times a particular word occurred in a particular document. This gives each word a list of numbers, and this list of numbers is called a word-vector. A good way to think of these numbers is as `meaning-coordinates', just as latitude and longitude associate spatial coordinates with points on the surface of the earth. Each word, then, gets assigned a list of coordinates which determines a point in some (often high-dimensional) vector space. Readers who are not completely happy with the language of vectors might find an introductory chapter useful.
When studying words and their properties, term document matrices are not ideal because many similar words are rarely used in the same document: for example, reports of sporting events often mention an umpire or a referee but only rarely use both of these words in the same article, making it difficult to work out that these words are very similar. The Infomap software addresses this by choosing a number of special content bearing words and assigning coordinates to the other words based upon how often then occurred near to one of these content bearing words. This is best illustrated by an example:
HOT-FROM-THE-OVEN MEALS: Keep hot food HOT; warm isn't good enough. Set the oven temperature at 140 degrees or hotter. Use a meat thermometer. And cover with foil to keep food moist. Eat within two hours. | ``Change is always happening,'' said the ebullient trumpeter, whose words tumble out almost as fast as notes from his trumpet. ``That's one of the wonderful things about jazz music.'' For many jazz fans, Ferguson is one of the wonderful things about jazz music. |
eat | hot | jazz | meat | trumpet | |
Music | 3 | 1 | |||
Food | 1 | 2 | 1 |
We proceed through the whole corpus of documents like this, for each word building up a "number signature" which tells us how often that word appeared in within a window of text near to each of the content bearing words. The size of this window of text can be altered, as can the choice of content bearing words. (A relatively stable method has been to choose as content words the 1000 most frequent words in a document collection after stopwords (such as the, we and of have been removed.) In this way, the WORDSPACE software builds a large coocurrence matrix in which the columns are content bearing words and each rows records the number of times a given word occured in the same region of text as a particular content bearing word.
It is at this stage of the preprocessing that the WORDSPACE software can incorporate extra linguistic information such as part of speech tags and multiword expressions, if these are suitably recorded in the corpus.
The technique normally used by the Infomap WORDSPACE software is called singular value decomposition, which has been used in information retrieval to reduce the sparseness in standard term document matrices. This process is often called latent semantic indexing or latent semantic analysis. This is only one possibility for reducing the number of dimensions of a dataset: others include probabilistic latent semantic analysis and local linear embedding.
When using the WORDSPACE software at Stanford we have used Mike Berry's SVDPACKC to calculate the singular value decomposition. The licensing for SVDPACKC is not the same as that for the WORDSPACE software itself and if you use SVDPACK you must make sure to obtain the correct licenses. The number of dimensions which you want to reduce to is another parameter which can be readily altered: we have had good results with 100 dimensions, other researchers have found that somewhere between 200 and 300 works best. As with many things, it is reasonable to assume that "best" is partly determined by the task in hand and that the question of "how many dimensions are needed to represent meaning" has many answers in different situations. These vectors are produced using the programs in the preprocessing directory.
One of the great benefits of the vector formalism is that it allows us to combine words into sentences or documents by adding their vectors together. If article vectors have been built during the preprocessing phase, the associate program can also be used to find nearby documents and thus for information retrieval.
Schütze sometimes calls such composite vectors context vectors because they gather information from the context surrounding a particular word. Context vectors can be clustered using a variety of clustering algorithms, and the centroids of the different clusters can be used to represent as different senses of words, giving sense vectors.
It turns out that precisely the same logical operators on vectors were used by Birkhoff and von Neumann in the 1930s to describe the logic of a quantum mechanical system, which is why the logical operators are called the quantum connectives and the system as a whole is called quantum logic.
The WORDSPACE software currently implements versions of quantum negation and negated disjunction (which for computational reasons turns out to be much more tractable than a positive disjunction).
The development of this software and the underlying methods has received contributions from several researchers during different phases of the Infomap project at the Computational Semantics Laboratory under the guidance of Stanley Peters. Hinrich Schütze pioneered many of the original methods, using the WORDSPACE model for word sense discrimination. Stefan Kaufmann was responsible for writing a new version of the software on which the current release is largely based. Dominic Widdows added logical connectives and incorporated other linguistic information including part of speech tags and multiword expressions. The current version for public release has been managed by Scott Cederberg. Contributions and experiments from several other researchers are described in our Papers.