Word Space Models and Random Indexing

Word Space Models and

Random Indexing

By Dileepa Jayakody

Overview● Text Similarity

● Word Space Model

– Distributional hypothesis

– Distance and Similarity measures

– Pros & Cons

– Dimension Reduction

● Random Indexing

– Example

– Random Indexing Parameters

– Data pre-processing in Random Indexing

– Random Indexing Benefits and Concerns

Text Similarity

● Human readers determine the similarity between texts by comparing the abstract meaning of the texts, whether they discuss a similar topic

● How to model meaning in programming?

● In simplest way, if 2 texts contain the same words, it is believed the texts have a similar meaning

Meaning of a Word

● The meaning of a word can be determined by the context formed by the surrounding words

● E.g : The meaning of the word “foorbar” is determined by the words co-occurred with it. e.g. "drink," "beverage" or "sodas."

– He drank the foobar at the game.

– Foobar is the number three beverage.

– A case of foobar is cheap compared to other sodas.

– Foobar tastes better when cold. ● Co-occurrence matrix represent the context vectors of

words/documents

Word Space Model

● The word-space model is a computational model of meaning to represent similarity between words/text

● It derives the meaning of words by plotting the words in an n-dimensional geometric space

Word Space Model

● The dimensions in word-space n can be arbitrarily large

(word * word | word * document)

● The coordinates used to plot each word depends upon the frequency of the contextual feature that each word co-occur with within a text

● e.g. words that do not co-occur with the word to be plotted within a given context are assigned a coordinate value of zero

● The set of zero and non-zero values corresponding to the coordinates of a word in a word-space are recorded in a context vector

Distributional Hypothesis in Word Space

● To deduce a certain level of meaning, the coordinates of a word needs to be measured relative to the coordinates of other words

● Linguistic concept known as the distributional hypothesis states that “words that occur in the same contexts tend to have similar meanings”

● The level of closeness of words in the word-space is called the spatial proximity of words

● Spatial proximity represents the semantic similarity of words in word space models

Distance and Similarity Measures

● Cosine Similarity

(A common approach used to determine spatial proximity by measuring the cosine of the angle between the plotted context vectors of the text)

● Other measures

– Euclidean

– Lin

– Jaccard

– Dice

Word Space Models

● Latent Semantic Analysis (document based co-occurrence : word * document)

● Hyperspace Analogue to Language (word based co-occurrence : word * word)

● Latent Dirichlet Allocation

● Random Indexing

Word Space Model Pros & Cons

● Pros

– Mathematically well defined model allows us to define semantic similarity in mathematical terms

– Constitutes a purely descriptive approach to semantic modeling; it does not require any previous linguistic or semantic knowledge

● Cons

– Efficiency and scalability problems with the high dimensionality of the context vectors

– Majority of the cells in the matrix will be zero due to the sparse data problem

Dimension Reduction

● Singular Value Decomposition

– matrix factorization technique that can be used to decompose and approximate a matrix, so that the resulting matrix has much fewer columns but similar in dimensions

● Non-negative matrix factorization

Cons of Dimension Reduction

● Computationally very costly

● One-time operation; Constructing the co-occurrence matrix and then transforming it has to be done from scratch, every time new data is encountered

● Fails to avoid the initial huge co-occurrence matrix. Requires initial sampling of the entire data which is computationally cumbersome

● No intermediary results. It is only after co-occurrence matrix is constructed and transformed the that any processing can begin

Random IndexingMagnus Sahlgren,

Swedish Institute of Computer Science, 2005

● A word space model that is inherently incremental and does not require a separate dimension reduction phase

● Each word is represented by two vectors

– Index vector : contains a randomly assigned label. The random label is a vector filled mostly with zeros, except a handful of +1 and -1 that are located at random indexes. Index vectors are expected be orthogonal

e.g. school = [0,0,0,......,0,1,0,...........,-1,0,..............]

– Context vector : produced by scanning through the text; each time a word occurs in a context (e.g. in a document, or within a sliding context window), that context's d-dimensional index vector is added to the context vector of the word in question

Random Indexing Example

● Sentence : "the quick brown fox jumps over the lazy dog."

● With a window-size of 2, the context vector for "fox" is calculated by adding the index vectors as below;

● N-2(quick) + N-1(brown) + N1(jumps) + N2(over); where N-k denotes the kth permutation of the specified index vector

● Two words will have similar context vectors if the words appear in similar contexts in the text

● Finally a document is represented by the sum of context vectors of all words occurred in the document

Random Indexing Parameters

● The length of the vector

– determines the dimensionality, storage requirements

● The number of nonzero (+1,-1) entries in the index vector

– has an impact on how the random distortion will be distributed over the index/context vector.

● Context window size (left and right context boundaries of a word)

● Weighting Schemes for words within context window

– Constant weighting

– Weighting factor that depends on the distance to the focus word in the middle of the context window

Data Preprocessing prior to Random Indexing

● Filtering Stop words : Frequent words like and, the, thus, hence contribute very little context unless looking at phrases

● Stemming words : reducing inflected words to their stem, base or root form. e.g. fishing, fisher, fished > fish

● Lemmatizing words : Closely related to stemming, but reduces the words to a single base or root form based on the word's context. e.g : better, good > good

● Preprocessing numbers, smilies, money : <number>, <smiley>, <money> to mark the sentence had a number/smiley at that position

Random Indexing Vs LSA

● In contrast to other WSMs like LSA which first construct the co-occurrence matrix and then extract context vectors; in the Random Indexing approach, the process is backwards

● First context vectors are accumulated, then a co-occurrence matrix is constructed by collecting the context vectors as rows of the matrix

● Compresses sparse raw data to a smaller representation without a separate dimensionality reduction phase as in LSA

Random Indexing Benefits

● The dimensionality of the final context vector of a document will not depend on the number of documents or words that have been indexed

● Method is incremental

● No need to sample all texts before results can be produced, hence intermediate results can be gained

● Simple computation for context vector generation

● Doesn't require intensive processing power and memory

Random Indexing Design Concerns

● Random distortion

– Possible non orthogonal values in the index & context vectors

– All words will have some similarity depending on the dimension used for vectors compared to the corpora loaded into the index (small dimension to represent a big corpora could result in random distortions)

– Have to decide what level of random distortion is acceptable to a context vector that represents a document based on the context vectors of singular words

Random Indexing Design Concerns

● Negative similarity scores

● Words with no similarity would normally be expected to get a cosine similarity score of zero, but with Random Indexing they sometimes get a negative score due to opposite sign on the same index in the word's context vector

● Proportional to the size of the corpora and dimensionality in the Random Index

Conclusion

● Random Indexing is an efficient and scalable word space model

● Can be used for text analysis applications requiring incremental approach to perform analysis.

e.g: email clustering and categorizing, online forum analysis

● Need to predetermine the optimal values for the parameters to gain high accuracy: dimensions, no. of non zero indexes and context window size

Thank you

Technology

Word Space Models and Random Indexing