Upload
saxon
View
64
Download
0
Embed Size (px)
DESCRIPTION
Latent Semantic Indexing SI650: Information Retrieva l. Winter 2010 School of Information University of Michigan. … Latent semantic indexing Singular value decomposition …. Problems with lexical semantics. Polysemy bar , bank, jaguar, hot tend to reduce precision Synonymy - PowerPoint PPT Presentation
Citation preview
2010 © University of Michigan
Latent Semantic IndexingSI650: Information Retrieval
Winter 2010
School of InformationUniversity of Michigan
1
2010 © University of Michigan
…Latent semantic indexing Singular value decomposition…
2010 © University of Michigan
Problems with lexical semantics• Polysemy
– bar, bank, jaguar, hot– tend to reduce precision
• Synonymy– building/edifice, Large/big, Spicy/hot– tend to reduce recall
• Relatedness– doctor/patient/nurse/treatment
• Sparse matrix• Need: dimensionality reduction
2010 © University of Michigan
Problem in Retrieval
4
Query = “information retrieval”
Document 1 = “inverted index precision recall”
Document 2 = “welcome to ann arbor”
• Which one should we rank higher?• Query vocabulary & doc vocabulary mismatch!• Smoothing won’t help here…• If only we can represent documents/queries by topics!
2010 © University of Michigan
Latent Semantic Indexing• Motivation
– Query vocabulary & doc vocabulary mismatch– Need to match/index based on concepts (or topics)
• Main idea:– Projects queries and documents into a space with
“latent” semantic dimensions– Dimensionality reduction: the latent semantic space
has fewer dimensions (semantic concepts)– Exploits co-occurrence: Co-occurring terms are
projected onto the same dimensions
2010 © University of Michigan6
Example of “Semantic Concepts”
(Slide from C. Faloutsos’s talk)
2010 © University of Michigan
Concept Space = Dimension Reduction
7
• Number of concepts (K) is always smaller than the number of words (N) or number of documents (M).
• If we represent a document as a N-dimension vector; and the corpus as an M*N matrix…
• The goal is to reduce the dimension from N to K.• But how can we do that?
2010 © University of Michigan
Techniques for dimensionality reduction
• Based on matrix decomposition (goal: preserve clusters, explain away variance)
• A quick review of matrices– Vectors– Matrices– Matrix multiplication
112
*1494852321
2010 © University of Michigan
Eigenvectors and eigenvalues• An eigenvector is an implicit “direction” for a matrix
where v (eigenvector) is non-zero, though λ (eigenvalue) can be any complex number in principle
• Computing eigenvalues (det = determinant):
if A is square (N x N), has r distinct solutions, where 1 <= r <= N
• For each λ found, you can find v by , or
0)det( IA
vvA
vvA
0)( vIA
0)det( IA
2010 © University of Michigan
Eigenvectors and eigenvalues• Example:
• det (A-I) = (-1-)*(-)-3*2=0• Then: +2-6=0; 1=2; 2=-3• For 12:
• Solutions: x1=x2
0231
A
2
31IA
022
33
2
1
xx
2010 © University of Michigan
Eigenvectors and eigenvalues• Wait, that means there are many eigenvectors
for the same eigenvalue… • v = (x1, x2)T; x1 = x2 corresponds to many vectors,
e.g., (1, 1)T, (2, 2)T, (650, 650)T…• Not surprising …
• if v is an eigenvector of A, v’ = cv is also an eigenvector (c is any non-zero constant)
11
vvA
2010 © University of Michigan
Matrix Decomposition• If A is a square (N x N) matrix and it has N
linearly independent eigenvectors, it can be decomposed into ULU-1
where U: matrix of eigenvectors (every column) L: diagonal matrix of eigenvalues
• AU = UL• U-1AU = L• A = ULU-1
2010 © University of Michigan
Example
3,1,2112
21
A
1111
U
2/12/12/12/11U
L
2/12/12/12/1
3001
11111UUA
L
3001
2010 © University of Michigan
Example
000020003
A
010
2v
Eigenvaluesare 3, 2, 0
100
3v
001
1v
321 642642
vvvx
)642(* 321 vvvAAx
321 642 AvAvAvAx
332211 642 vvvAx
x is an arbitrary vector, yet Sx dependson the eigenvalues and eigenvectors
2010 © University of Michigan 15
What about an arbitrary matrix?
• A: n x m matrix (n documents, m terms)• A = USVT (as opposed to A = ULU-1)• U: n x n matrix;• V: m x m matrix• S: n x m diagonal matrix only values on
the diagonal can be non-zero. • UUT = I; VVT = I
2010 © University of Michigan
SVD: Singular Value Decomposition
• A=USVT
• U is the matrix of orthogonal eigenvectors of AAT
• V is the matrix of orthogonal eigenvectors of ATA• The components of S are the eigenvalues of ATA• This decomposition exists for all matrices, dense
or sparse• If A has 5 columns and 3 rows, then U will be
5x5 and V will be 3x3• In Matlab, use [U,S,V] = svd (A)
2010 © University of Michigan
Term matrix normalization
101
100
000101101
000
011
110110101
A
71.000.071.0
58.000.000.0
00.000.000.045.000.058.045.000.058.0
00.000.000.0
00.058.058.0
45.071.000.045.071.000.045.000.058.0
)(nA
D1 D2 D3 D4 D5
D1 D2 D3 D4 D5
2010 © University of Michigan
Example (Berry and Browne)• T1: baby• T2: child• T3: guide• T4: health • T5: home• T6: infant• T7: proofing• T8: safety• T9: toddler
• D1: infant & toddler first aid• D2: babies & children’s room (for
your home)• D3: child safety at home• D4: your baby’s health and
safety: from infant to toddler• D5: baby proofing basics• D6: your guide to easy rust
proofing• D7: beanie babies collector’s
guide
2010 © University of Michigan
Document term matrix
000100100011000110000000100100001100001000110000000001101011010
A
00045.00071.000045.058.000071.071.0000000045.00071.0000058.058.0000045.000071.071.000000000058.058.0071.0071.045.0058.00
)(nA
2010 © University of Michigan
Decomposition
u =
-0.6976 -0.0945 0.0174 -0.6950 0.0000 0.0153 0.1442 -0.0000 0 -0.2622 0.2946 0.4693 0.1968 -0.0000 -0.2467 -0.1571 -0.6356 0.3098 -0.3519 -0.4495 -0.1026 0.4014 0.7071 -0.0065 -0.0493 -0.0000 0.0000 -0.1127 0.1416 -0.1478 -0.0734 0.0000 0.4842 -0.8400 0.0000 -0.0000 -0.2622 0.2946 0.4693 0.1968 0.0000 -0.2467 -0.1571 0.6356 -0.3098 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 -0.3098 -0.6356 -0.3519 -0.4495 -0.1026 0.4014 -0.7071 -0.0065 -0.0493 0.0000 -0.0000 -0.2112 0.3334 0.0962 0.2819 -0.0000 0.7338 0.4659 -0.0000 0.0000 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 0.3098 0.6356
v =
-0.1687 0.4192 -0.5986 0.2261 0 -0.5720 0.2433 -0.4472 0.2255 0.4641 -0.2187 0.0000 -0.4871 -0.4987 -0.2692 0.4206 0.5024 0.4900 -0.0000 0.2450 0.4451 -0.3970 0.4003 -0.3923 -0.1305 0 0.6124 -0.3690 -0.4702 -0.3037 -0.0507 -0.2607 -0.7071 0.0110 0.3407 -0.3153 -0.5018 -0.1220 0.7128 -0.0000 -0.0162 -0.3544 -0.4702 -0.3037 -0.0507 -0.2607 0.7071 0.0110 0.3407
2010 © University of Michigan
DecompositionS = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0.7100 0 0 0 0 0 0 0 0.5692 0 0 0 0 0 0 0 0.1977 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Spread on the v1 axis
2010 © University of Michigan
What does this have to do with dimension reduction?
• Low rank matrix approximation• SVD: A[m*n] = U[m*m]S[m*n]VT
[n*n]
• Remember that S is a diagonal matrix of eigenvalues
• If we only keep the largest r eigenvalues.. • A ≈ U[m*r]S[r*r]VT
[n*r]
22
2010 © University of Michigan
Rank-4 approximation
s4 =
1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2010 © University of Michigan
Rank-4 approximation
u*s4*v' -0.0019 0.5985 -0.0148 0.4552 0.7002 0.0102 0.7002 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.1980 0.0514 0.0064 0.2199 0.0535 -0.0544 0.0535 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.2165 0.2494 0.4367 0.2282 -0.0360 0.0394 -0.0360 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008
2010 © University of Michigan
Rank-4 approximationu*s4: word vector representation of the concepts/topics
-1.1056 -0.1203 0.0207 -0.5558 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.1786 0.1801 -0.1765 -0.0587 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.3348 0.4241 0.1149 0.2255 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0
2010 © University of Michigan
Rank-4 approximations4*v': new (concept/topic) representation of documents
-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 -0.7150 0.5544 0.6001 -0.4686 -0.0605 -0.1457 -0.0605 0.1808 -0.1749 0.3918 -0.1043 -0.2085 0.5700 -0.2085 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2010 © University of Michigan
Rank-2 approximations2 =
1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2010 © University of Michigan
Rank-2 approximation
u*s2*v'
0.1361 0.4673 0.2470 0.3908 0.5563 0.4089 0.5563 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.1057 0.1205 0.1239 0.1430 0.0293 -0.0341 0.0293 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.2343 0.2454 0.2685 0.3027 0.0286 -0.1073 0.0286 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048
2010 © University of Michigan
Rank-2 approximation
u*s2: word vector representation of the concepts/topics
-1.1056 -0.1203 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.1786 0.1801 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.3348 0.4241 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0
2010 © University of Michigan
Rank-2 approximation
s2*v': new (concept/topic) representation of documents
-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2010 © University of Michigan 31
Latent Semantic Indexing
A[n x m] ≈ U[n x r] L [ r x r] (V[m x r])T
• A: n x m matrix (n documents, m terms)• U: n x r matrix (n documents, r concepts)• L: r x r diagonal matrix (strength of each
‘concept’) (r : rank of the matrix)• V: m x r matrix (m terms, r concepts)
2010 © University of Michigan
Latent semantic indexing (LSI)• Dimensionality reduction = identification of
hidden (latent) concepts• Query matching in latent space• LSI matches documents even if they don’t have
words in common; – If they share frequently co-occurring terms
2010 © University of Michigan33
Back to the CS-MED example
(Slide from C. Faloutsos’s talk)
2010 © University of Michigan 34
Example of LSI
1 1 1 0 02 2 2 0 01 1 1 0 05 5 5 0 00 0 0 2 20 0 0 3 30 0 0 1 1
datainf
retrieval
brainlung
0.18 00.36 00.18 00.90 00 0.530 0.800 0.27
=CS
MD
9.64 00 5.29x
0.58 0.58 0.58 0 00 0 0 0.71 0.71
x
CS-concept MD-concept
Term rep of concept
(Slide adapted from C. Faloutsos’s talk)
Strength of CS-concept
Dim. Reduction
A = U L VT
2010 © University of Michigan 35
How to Map Query/Doc to the Same Concept Space?
qTconcept = qT V
dTconcept = dT V
1 0 0 0 0
datainf.
retrieval
brainlung
qT
=
0.58 0 0.58 0 0.58 0 0 0.71 0 0.71
=0.58 0
Similarity with
CS-concept
CS-concept
dT
= 0 1 1 0 0 1.16 0
(Slide adapted from C. Faloutsos’s talk)
2010 © University of Michigan
Useful pointers• http://lsa.colorado.edu • http://lsi.research.telcordia.com • http://www.cs.utk.edu/~lsi
2010 © University of Michigan
Readings• MRS18• MRS17, MRS19• MRS20
2010 © University of Michigan
Problem of LSI• Concepts/Topics are hard to interpret• New document/query vectors could have
negative values• Lack of statistical interpretation• Probabilistic latent semantic indexing…
38
2010 © University of Michigan39
General Idea of Probabilistic Topic Models
• Modeling a topic/subtopic/theme with a multinomial distribution (unigram LM)
• Modeling text data with a mixture model involving multinomial distributions – A document is “generated” by sampling words from some multinomial
distribution– Each time, a word may be generated from a different distribution– Many variations of how these multinomial distributions are mixed
• Topic mining = Fitting the probabilistic model to text
• Answer topic-related questions by computing various kinds of conditional probabilities based on the estimated model (e.g., p(time | topic), p(time | topic, location))
2010 © University of Michigan
40
Document as a Sample of Mixed Topics
• Applications of topic models:– Summarize themes/aspects– Facilitate navigation/browsing– Retrieve documents– Segment documents– Many others
• How can we discover these topic word distributions?
Topic 1
Topic k
Topic 2
…
Background B
government 0.3 response 0.2...
donate 0.1relief 0.05help 0.02 ...
city 0.2new 0.1orleans 0.05 ...
is 0.05the 0.04a 0.03 ...
[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …
2010 © University of Michigan41
Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann
99]• Mix k multinomial distributions to generate a document• Each document has a potentially different set of mixing
weights which captures the topic coverage• When generating words in a document, each word may be
generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model)
• We may add a background distribution to “attract” background words
2010 © University of Michigan
PLSI (a.k.a. Aspect Model)• Every document is a mixture of underlying (latent) K
aspects (topics) with mixture weights p(z|d)– How is this related to LSI?
• Each aspect is represented by a distribution of words p(w|z)
• Estimate p(z|d) and p(w|z) using EM algorithm
)|()(),(
)|()|()|(
dwpdpdwp
zwpdzpdwpZz
2010 © University of Michigan43
PLSI as a Mixture Model
Topic z1
Topic zk
Topic z2
…
Document d
Background B
warning 0.3 system 0.2..
aid 0.1donation 0.05support 0.02 ..
statistics 0.2loss 0.1dead 0.05 ..
is 0.05the 0.04a 0.03 ..
k
1
2
BB
W
p(z1|d)
1 - B
“Generating” word w in doc d in the collection
Parameters: B=noise-level (manually set)P(z|d) and p(w|z) are estimated with Maximum Likelihood
)|()(),(
)|()|()|(
dwpdpdwp
zwpdzpdwpZz
??
??
?
???
??
?
p(z2|d)
p(zk|d)
2010 © University of Michigan
Parameter Estimation using EM Algorithm
• We have the equation for log-likelihood function from the PLSI model, which we want to maximize:
• Maximizing likelihood using Expectation Maximization
Dd Ww
dwpdwcL ),(log),(
2010 © University of Michigan
EM Steps• E-Step
– Expectation step where expectation of the likelihood function is calculated with the current parameter values
• M-Step– Update the parameters with the calculated posterior
probabilities– Find the parameters that maximizes the likelihood
function
2010 © University of Michigan
E Step• It is the probability that a word w occurring in a
document d, is explained by topic z
Zz
zwpdzpzwpdzpwdzp
)|()|()|()|(),|(
2010 © University of Michigan
M Step• All these equations use p(z|d,w) calculated in E Step
• Converges to a local maximum of the likelihood function• We will see more when we talk about topic modeling
Dd Ww
Dd
Zz Ww
Ww
wdzpdwc
wdzpdwczwp
wdzpdwc
wdzpdwcdzp
),|(),(
),|(),()|(
),|(),(
),|(),()|(
2010 © University of Michigan
Example of PLSI
48
2010 © University of Michigan
Topics represented as word distributions
• Topics are interpretable!
49
- Example of topics found from blog articles about “Hurricane Katrina”