Upload
nam
View
45
Download
6
Embed Size (px)
DESCRIPTION
Probabilistic Topic Models for Text Mining. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, [email protected]. - PowerPoint PPT Presentation
Citation preview
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 1
Probabilistic Topic Models for
Text MiningChengXiang Zhai (翟成祥 )
Department of Computer Science
Graduate School of Library & Information Science
Institute for Genomic Biology, Statistics
University of Illinois, Urbana-Champaign
http://www-faculty.cs.uiuc.edu/~czhai, [email protected]
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 2
What Is Text Mining?
“The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001)
“Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999)
(Slide from Rebecca Hwa’s “Intro to Text Mining”)
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 3
Two Different Views of Text Mining
• Data Mining View: Explore patterns in textual data
– Find latent topics
– Find topical trends
– Find outliers and other hidden patterns
• Natural Language Processing View: Make inferences based on partial understanding natural language text
– Information extraction
– Question answering
Shallow mining
Deep mining
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 4
Applications of Text Mining
• Direct applications: Go beyond search to find knowledge
– Question-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions?
– Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of data; what can we do with it?
• Indirect applications
– Assist information access (e.g., discover latent topics to better summarize search results)
– Assist information organization (e.g., discover hidden structures)
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 5
Text Mining Methods
• Data Mining Style: View text as high dimensional data– Frequent pattern finding
– Association analysis
– Outlier detection
• Information Retrieval Style: Fine granularity topical analysis– Topic extraction
– Exploit term weighting and text similarity measures
• Natural Language Processing Style: Information Extraction– Entity extraction
– Relation extraction
– Sentiment analysis
– Question answering
• Machine Learning Style: Unsupervised or semi-supervised learning– Mixture models
– Dimension reduction Topic of this lecture
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 6
Outline
• The Basic Topic Models: – Probabilistic Latent Semantic Analysis (PLSA) [Hofmann
99]
– Latent Dirichlet Allocation (LDA) [Blei et al. 02]
• Extensions– Contextual Probabilistic Latent Semantic Analysis
(CPLSA) [Mei & Zhai 06]
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 7
Basic Topic Model: PLSA
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 8
PLSA: Motivation
Government Response New Orleans Oil Price Praying and Blessing Aid and Donation Personal bush 0.071 city 0.063 price 0.077 god 0.141 donate 0.120 i 0.405president 0.061 orleans 0.054 oil 0.064 pray 0.047 relief 0.076 my 0.116federal 0.051 new 0.034 gas 0.045 prayer 0.041 red 0.070 me 0.060government 0.047 louisiana 0.023 increase 0.020 love 0.030 cross 0.065 am 0.029fema 0.047 flood 0.022 product 0.020 life 0.025 help 0.050 think 0.015administrate 0.023 evacuate 0.021 fuel 0.018 bless 0.025 victim 0.036 feel 0.012response 0.020 storm 0.017 company 0.018 lord 0.017 organize 0.022 know 0.011brown 0.019 resident 0.016 energy 0.017 jesus 0.016 effort 0.020 something 0.007blame 0.017 center 0.016 market 0.016 will 0.013 fund 0.019 guess 0.007governor 0.014 rescue 0.012 gasoline 0.012 faith 0.012 volunteer 0.019 myself 0.006
What did people say in their blog articles about “Hurricane Katrina”?
Query = “Hurricane Katrina”
Results:
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 9
Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann 99]
• Mix k multinomial distributions to generate a document
• Each document has a potentially different set of mixing weights which captures the topic coverage
• When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model)
• We may add a background distribution to “attract” background words
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 10
PLSA as a Mixture Model
Topic 1
Topic k
Topic 2
…
Document d
Background B
warning 0.3 system 0.2..
aid 0.1donation 0.05support 0.02 ..
statistics 0.2loss 0.1dead 0.05 ..
is 0.05the 0.04a 0.03 ..
k
1
2
B
B
W
d,1
d, k
1 - Bd,2
“Generating” word w in doc d in the collection
Parameters: B=noise-level (manually set)’s and ’s are estimated with Maximum Likelihood
])|()1()|([log),()(log
)|()1()|()(
1,
1,
k
jjjdBBB
Vw
k
jjjdBBBd
wpwpdwcdp
wpwpwp
??
??
?
???
??
?
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 11
How to Estimate j: EM Algorithm
the 0.2a 0.1we 0.01to 0.02…
KnownBackground
p(w | B)
…text =? mining =? association =?word =? …
Unknowntopic model
p(w|1)=?
“Text mining”
Observed Doc(s)
MLEstimator
…
…information =? retrieval =? query =?document =? …
Unknowntopic model
p(w|2)=?
“informationretrieval”
Suppose, we knowthe identity of each word ...
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 12
How the Algorithm Works
12
aidprice
oil
πd1,1 ( P(θ1|d1) )
πd1,2 ( P(θ2|d1) )
πd2,1 ( P(θ1|d2) )
πd2,2 ( P(θ2|d2) )
aidprice
oil
Topic 1 Topic 2
aid
price
oil
P(w| θ)
Initial value
Initial value
Initial value
Initializing πd, j and P(w| θj) with random values
Iteration 1: E Step: split word counts with different topics (by computing z’ s)
Iteration 1: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the
splitted word counts
Iteration 2: E Step: split word counts with different topics (by computing z’ s)
Iteration 2: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the
splitted word counts
Iteration 3, 4, 5, …Until converging
756
875
d1
d2
c(w, d)c(w,d)p(zd,w = B)
c(w,d)(1 - p(zd,w = B))p(zd,w=j)
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 13
Vw
m
i Cd wdwd
m
i Cd wdwd
jn
j Vw wdwd
Vw wdwdnjd
k
j jnn
jdBBB
BBwd
k
j jnn
jd
jnn
jdwd
i
i
jzpBzpdwc
jzpBzpdwcwp
jzpBzpdwc
jzpBzpdwc
wpwp
wpBzp
wp
wpjzp
' 1 ',',
1 ,,)1(
' ,,
,,)1(,
1
)()(,
,
1' ')()(
',
)()(,
,
)())(1)(,'(
)())(1)(,()|(
)'())(1)(,(
)())(1)(,(
)|()1()|(
)|()(
)|(
)|()(
Parameter EstimationE-Step: Word w in doc d is generated- from cluster j- from background
Application of Bayes rule
M-Step:Re-estimate - mixing weights- cluster LM
Fractional counts contributing to- using cluster j in generating d- generating w from cluster j
Sum over all docs(in multiple collections)m = 1 if one collection
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 14
PLSA with Prior Knowledge
• There are different ways of choosing aspects (topics)
– Google = Google News + Google Map + Google scholar, …
– Google = Google US + Google France + Google China, …
• Users have some domain knowledge in mind, e.g.,
– We expect to see “retrieval models” as a topic in IR.
– We want to show the aspects of “history” and “statistics” for Youtube
• A flexible way to incorporate such knowledge as priors of PLSA model
• In Bayesian, it’s your “belief” on the topic distributions
14
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 1515
Adding Prior
Topic 1
Topic k
Topic 2
…
Document d
Background B
warning 0.3 system 0.2..
aid 0.1donation 0.05support 0.02 ..
statistics 0.2loss 0.1dead 0.05 ..
is 0.05the 0.04a 0.03 ..
k
1
2
B
B
W
d,1
d, k
1 - Bd,2
“Generating” word w in doc d in the collection
Parameters: B=noise-level (manually set)’s and ’s are estimated with Maximum Likelihood
)|()(maxarg*
DatappMost likely
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 16
Adding Prior as Pseudo Counts
16
the 0.2a 0.1we 0.01to 0.02…
KnownBackground
p(w | B)
…text =? mining =? association =?word =? …
Unknowntopic model
p(w|1)=?
“Text mining”
…information =? retrieval =? query =?document =? …
…Unknown
topic modelp(w|2)=?
“informationretrieval”
Suppose, we knowthe identity of each word ...
Observed Doc(s)
MAPEstimator
Pseudo Doc
Size = μtext
mining
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 1717
Maximum A Posterior (MAP) Estimation
+p(w|’j)+
Pseudo counts of w from prior ’
Sum of all pseudo counts
What if =0? What if =+?
Vw
m
i Cd wdwd
m
i Cd wdwd
jn
j Vw wdwd
Vw wdwdnjd
k
j jnn
jdBBB
BBwd
k
j jnn
jd
jnn
jdwd
i
i
jzpBzpdwc
jzpBzpdwcwp
jzpBzpdwc
jzpBzpdwc
wpwp
wpBzp
wp
wpjzp
' 1 ',',
1 ,,)1(
' ,,
,,)1(,
1
)()(,
,
1' ')()(
',
)()(,
,
)())(1)(,'(
)())(1)(,()|(
)'())(1)(,(
)())(1)(,(
)|()1()|(
)|()(
)|(
)|()(
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 18
Basic Topic Model: LDA
The following slides about LDA are taken from Michael C. Mozer’s course lecturehttp://www.cs.colorado.edu/~mozer/courses/ProbabilisticModels/
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
LDA: Motivation– “Documents have no generative probabilistic semantics”
•i.e., document is just a symbol
– Model has many parameters•linear in number of documents
•need heuristic methods to prevent overfitting
– Cannot generalize to new documents
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Unigram Model
N
nnwpp
1
)()(w
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Mixture of Unigrams
z
N
nn zwpzpp
1
)|()()(w
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Topic Model / Probabilistic LSI
z
nn dzpzwpdpwdp )|()|()(),(
•d is a localist representation of (trained) documents
•LDA provides a distributed representation
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
LDA•Vocabulary of |V| words
•Document is a collection of words from vocabulary.•N words in document
•ww = (w1, ..., wN)
•Latent topics•random variable z, with values 1, ..., k
•Like topic model, document is generated by sampling a topic from a mixture and then sampling a word from a mixture.
•But topic model assumes a fixed mixture of topics (multinomial distribution) for each document.
•LDA assumes a random mixture of topics (Dirichlet distribution) for each topic.
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Generative Model
•“Plates” indicate looping structure•Outer plate replicated for each document
•Inner plate replicated for each word
•Same conditional distributions apply for each replicate
•Document probability
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Fancier Version
111
1
1 1
)(
)()(
k
kki i
ki ip
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Inference
kN
n znnn
nn
N
nn
pp
dzwpzppp
zwpzppp
p
n
1
1
),|(),|,,(
),()()(),(
),()()(),,,(
),,|,(
w
wz
wz wwz
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Inference
•In general, this formula is intractable:
•Expanded version:
kN
n znnn dzwpzppp
n
1
),()()(),(w
dpN
n
k
i
V
j
wiji
k
ii
i i
i i jni
1 1 11
1 )()(
)(),|(w
1 if wn is the j'th vocab word
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Variational Approximation
•Computing log likelihood and introducing Jensen's inequality: log(E[x]) >= E[log(x)]
•Find variational distribution q such that the above equation is computable.– q parameterized by γ and φ
n
– Maximize bound with respect to γ and φn to obtain best approximation
to p(w | α, β)
– Lead to variational EM algorithm
•Sampling algorithms (e.g., Gibbs sampling) are also common
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Data Sets
C. Elegans Community abstracts5,225 abstracts28,414 unique terms
TREC AP corpus (subset)16,333 newswire articles23,075 unique terms
Held-out data – 10% Removed terms
50 stop words, words appearing once
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
C. Elegans
Note: fold in hack for pLSI to allow it to handle novel documents.Involves refitting p(z|d
new) parameters -> sort of a cheat
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
AP
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 32
Summary: PLSA vs. LDA
• LDA adds a Dirichlet distribution on top of PLSA to regularize the model
• Estimation of LDA is more complicated than PLSA
• LDA is a generative model, while PLSA isn’t
• PLSA is more likely to over-fit the data than LDA
• Which one to use?
– If you need generalization capacity, LDA
– If you want to mine topics from a collection, PLSA may be better (we want overfitting!)
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 33
Extension of PLSA: Contextual Probabilistic Latent
Semantic Analysis (CPLSA)
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 34
A General Introduction to EM
Data: X (observed) + H(hidden) Parameter:
“Incomplete” likelihood: L( )= log p(X| )“Complete” likelihood: Lc( )= log p(X,H| )
EM tries to iteratively maximize the incomplete likelihood:
Starting with an initial guess (0),
1. E-step: compute the expectation of the complete likelihood
2. M-step: compute (n) by maximizing the Q-function
( 1)
( 1) ( 1)( ; ) [ ( ) | ] ( | , ) log ( , )n
i
n nc i i
h
Q E L X p H h X P X h
( ) ( 1) ( 1)arg max ( ; ) arg max ( | , ) log ( , )i
n n ni i
h
Q p H h X P X h
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 35
Convergence Guarantee
Goal: maximizing “Incomplete” likelihood: L( )= log p(X| )
I.e., choosing (n), so that L((n))-L((n-1))0
Note that, since p(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, ) L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X, (n-1) )/p(H|X, (n))]
Taking expectation w.r.t. p(H|X, (n-1)), L((n))-L((n-1)) = Q((n); (n-1))-Q( (n-1); (n-1)) + D(p(H|X, (n-1))||p(H|X, (n)))
KL-divergence, always non-negativeEM chooses (n) to maximize Q
Therefore, L((n)) L((n-1))!
Doesn’t contain H
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 36
Another way of looking at EM
Likelihood p(X| )
current guess
Lower bound(Q function)
next guess
E-step = computing the lower boundM-step = maximizing the lower bound
L((n-1)) + Q(; (n-1)) -Q( (n-1); (n-1) ) + D(p(H|X, (n-1) )||p(H|X, ))
L((n-1)) + Q(; (n-1)) -Q( (n-1); (n-1) )
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 37
Why Contextual PLSA?
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 38
Motivating Example:Comparing Product Reviews
Common Themes “IBM” specific “APPLE” specific “DELL” specific
Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs
Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB
Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz
IBM LaptopReviews
APPLE LaptopReviews
DELL LaptopReviews
Unsupervised discovery of common topics and their variations
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 39
Motivating Example:Comparing News about Similar Topics
Common Themes “Vietnam” specific “Afghan” specific “Iraq” specific
United nations … … …Death of people … … …… … … …
Vietnam War Afghan War Iraq War
Unsupervised discovery of common topics and their variations
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 40
Motivating Example:Discovering Topical Trends in Literature
Unsupervised discovery of topics and their temporal variations
Theme Strength
Time
1980 1990 1998 2003TF-IDF Retrieval
IR Applications
Language Model
Text Categorization
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 41
Motivating Example:Analyzing Spatial Topic Patterns
• How do blog writers in different states respond to topics such as “oil price increase during Hurricane Karina”?
• Unsupervised discovery of topics and their variations in different locations
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 42
Motivating Example: Sentiment Summary
Unsupervised/Semi-supervised discovery of topics and different sentiments of the topics
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 43
Research Questions
• Can we model all these problems generally?
• Can we solve these problems with a unified approach?
• How can we bring human into the loop?
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 44
Contextual Text Mining
• Given collections of text with contextual information (meta-data)
• Discover themes/subtopics/topics (interesting word clusters)
• Compute variations of themes over contexts
• Applications:– Summarizing search results
– Federation of text information
– Opinion analysis
– Social network analysis
– Business intelligence
– ..
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 45
Context Features of Text (Meta-data)
Weblog Article
Author
Author’s OccupationLocationTime
communities
source
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 46
Context = Partitioning of Text
1999
2005
2006
1998
…… ……
papers written in 1998
WWW SIGIR ACL KDD SIGMOD
papers written by authors in US
Papers about Web
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 47
Themes/Topics
• Uses of themes:– Summarize topics/subtopics
– Navigate in a document space
– Retrieve documents
– Segment documents
– …
Theme 1
Theme k
Theme 2
…
Background B
government 0.3
response 0.2..donate 0.1relief 0.05help 0.02 ..
city 0.2new 0.1orleans 0.05 ..
Is 0.05the 0.04a 0.03 ..
[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 48
View of Themes: Context-Specific Version of Views
Context: After 1998 (Language models)
Context: Before 1998 (Traditional models)
vectorspace
TF-IDF
Okapi
LSIvector
Rocchioweighting
feedbackterm
retrieval
feedback
languagemodelsmoothing
querygeneration
mixture
estimateEM
pseudo
model
feedbackjudgeexpansionpseudoquery
Theme 2:
FeedbackTheme 1:
Retrieval Model
retrieve
modelrelevancedocumentquery
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 49
Coverage of Themes: Distribution over Themes
Background
• Theme coverage can depend on context
Oil Price
Government Response
Aid and donation
Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …
Background
Oil PriceGovernment Response
Aid and donation
Context: Texas
Context: Louisiana
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 50
General Tasks of Contextual Text Mining
• Theme Extraction: Extract the global salient themes
– Common information shared over all contexts
• View Comparison: Compare a theme from different views
– Analyze the content variation of themes over contexts
• Coverage Comparison: Compare the theme coverage of different contexts
– Reveal how closely a theme is associated to a context
• Others:
– Causal analysis
– Correlation analysis
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 51
A General Solution: CPLSA
• CPLAS = Contextual Probabilistic Latent Semantic Analysis
• An extension of PLSA model ([Hofmann 99]) by
– Introducing context variables
– Modeling views of topics
– Modeling coverage variations of topics
• Process of contextual text mining
– Instantiation of CPLSA (context, views, coverage)
– Fit the model to text data (EM algorithm)
– Compute probabilistic topic patterns
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 52
Documentcontext:
Time = July 2005Location = Texas
Author = xxxOccup. = Sociologist
Age Group = 45+…
“Generation” Process of CPLSA
View1 View2 View3Themes
government
donation
New Orleans
government 0.3 response 0.2..
donate 0.1relief 0.05help 0.02 ..
city 0.2new 0.1orleans 0.05 ..
Texas July 2005
sociologist
Theme coverages:
Texas July 2005 document
……
Choose a view
Choose a Coverage
government
donate
new
Draw a word from i
response
aid help
Orleans
Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …
Choose a theme
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 53
• To generate a document D with context feature set C:
– Choose a view vi according to the view distribution
– Choose a coverage кj according to the coverage distribution
– Choose a theme according to the coverage кj
– Generate a word using
– The likelihood of the document collection is:
Probabilistic Model
),|( CDvp i
),|( CDp j
il
D
D),( 111
))|()|(),|(),|(log(),()(logCD Vw
k
lilj
m
jj
n
ii wplpCDpCDvpDwcp
il
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 54
Parameter Estimation: EM Algorithm
• Interesting patterns:
– Theme content variation for each view:
– Theme strength variation for each context
• Prior from a user can be incorporated using MAP estimation
n
i
m
j
k
l lit
jt
jt
it
ilt
jt
jt
it
ljiwwplpCDpCDvp
wplpCDpCDvpzp
1' 1' 1' '')()(
')(
')(
)()()()(
,,,)|()'|'(),|(),|(
)|()|(),|(),|()1(
n
i Vw
m
j
k
l ljiw
Vw
m
j
k
l ljiw
it
zpDwc
zpDwcCDvp
1' 1' 1' ',',',
1 1 ,,,)1(
)1(),(
)1(),(),|(
m
j Vw
n
i
k
l ljiw
Vw
n
i
k
l ljiwj
t
zpDwc
zpDwcCDp
1' 1' 1' ',',',
1 1 ,,,)1(
)1(),(
)1(),(),|(
l
l CD Vw
n
i ljiw
CD Vw
n
i ljiw
jt
zpDwc
zpDwclp
1' ),( ' 1' ',,',
),( 1 ,,,)1(
)1(),(
)1(),()|(
D
D
Vw CD
m
j ljiw
CD
m
j ljiw
ilt
zpDwc
zpDwcwp
' ),( 1' ,',,'
),( 1 ,,,)1(
)1(),'(
)1(),()|(
D
D
)|( ilwp
)|( jlp
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 55
Regularization of the Model• Why?
– Generality high complexity (inefficient, multiple local maxima)
– Real applications have domain constraints/knowledge
• Two useful simplifications: – Fixed-Coverage: Only analyze the content variation of themes (e.g.,
author-topic analysis, cross-collection comparative analysis )
– Fixed-View: Only analyze the coverage variation of themes (e.g., spatiotemporal theme analysis)
• In general
– Impose priors on model parameters
– Support the whole spectrum from unsupervised to supervised learning
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 56
Interpretation of Topics
Statistical topic models
term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…
term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…
term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…
Multinomial topic models
NLP ChunkerNgram stat.
database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure …
Candidate label pool
Collection (Context)
Ranked Listof Labels
clustering algorithm;distance measure;…
Relevance Score Re-ranking
Coverage; Discrimination
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 57
Relevance: the Zero-Order Score
• Intuition: prefer phrases covering high probability words
Clustering
dimensional
algorithm
birch
shape
Latent Topic
…
Good Label (l1): “clustering algorithm”
body
Bad Label (l2): “body shape”
…
p(w|)
)(
)|(
lp
lp
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 58
Relevance: the First-Order Score
• Intuition: prefer phrases with similar context (distribution)
Clustering
dimension
partition
algorithm
hash
Clustering
hash
dimension
algorithm
partition
C: SIGMOD Proceedings
Topic
… …
P(w|) P(w|l1)
D(||l1) < D(||l2)
Good Label (l1):“clustering algorithm”
Clustering
hash
dimension
join
algorithm
…
Bad Label (l2):“hash join”
P(w|l2)
w
ClwPMIwp )|,()|(
Score (l, )
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 59
Sample Results
• Comparative text mining
• Spatiotemporal pattern mining
• Sentiment summary
• Event impact analysis
• Temporal author-topic analysis
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 60
Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles)
Cluster 1 Cluster 2 Cluster 3
Common
Theme
united 0.042nations 0.04…
killed 0.035month 0.032deaths 0.023…
…
Iraq
Theme
n 0.03Weapons 0.024Inspections 0.023…
troops 0.016hoon 0.015sanches 0.012…
…
Afghan
Theme
Northern 0.04alliance 0.04kabul 0.03taleban 0.025aid 0.02…
taleban 0.026rumsfeld 0.02hotel 0.012front 0.011…
…
The common theme indicates that “United Nations” is involved in both wars
Collection-specific themes indicate different roles of “United Nations” in the two wars
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 61
Comparing Laptop Reviews
Top words serve as “labels” for common themes(e.g., [sound, speakers], [battery, hours], [cd,drive])
These word distributions can be used to segment text and add hyperlinks between documents
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 62
Spatiotemporal Patterns in Blog Articles
• Query= “Hurricane Katrina”
• Topics in the results:
• Spatiotemporal patterns
Government Response New Orleans Oil Price Praying and Blessing Aid and Donation Personal bush 0.071 city 0.063 price 0.077 god 0.141 donate 0.120 i 0.405president 0.061 orleans 0.054 oil 0.064 pray 0.047 relief 0.076 my 0.116federal 0.051 new 0.034 gas 0.045 prayer 0.041 red 0.070 me 0.060government 0.047 louisiana 0.023 increase 0.020 love 0.030 cross 0.065 am 0.029fema 0.047 flood 0.022 product 0.020 life 0.025 help 0.050 think 0.015administrate 0.023 evacuate 0.021 fuel 0.018 bless 0.025 victim 0.036 feel 0.012response 0.020 storm 0.017 company 0.018 lord 0.017 organize 0.022 know 0.011brown 0.019 resident 0.016 energy 0.017 jesus 0.016 effort 0.020 something 0.007blame 0.017 center 0.016 market 0.016 will 0.013 fund 0.019 guess 0.007governor 0.014 rescue 0.012 gasoline 0.012 faith 0.012 volunteer 0.019 myself 0.006
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 63
Theme Life Cycles for Hurricane Katrina
city 0.0634orleans 0.0541new 0.0342louisiana 0.0235flood 0.0227evacuate 0.0211storm 0.0177…
price 0.0772oil 0.0643gas 0.0454 increase 0.0210product 0.0203fuel 0.0188company 0.0182…
Oil Price
New Orleans
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 64
Theme Snapshots for Hurricane Katrina
Week4: The theme is again strong along the east coast and the Gulf of Mexico
Week3: The theme distributes more uniformly over the states
Week2: The discussion moves towards the north and west
Week5: The theme fades out in most states
Week1: The theme is the strongest along the Gulf of Mexico
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 65
Theme Life Cycles: KDD
0
0. 002
0. 004
0. 006
0. 008
0. 01
0. 012
0. 014
0. 016
0. 018
0. 02
1999 2000 2001 2002 2003 2004Time (year)
Nor
mal
ized
Str
engt
h of
The
me
Biology Data
Web Information
Time Series
Classification
Association Rule
Clustering
Bussiness
Global Themes life cycles of KDD AbstractsGlobal Themes life cycles of KDD Abstracts
gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038…
marketing 0.0087customer 0.0086model 0.0079business 0.0048…
rules 0.0142association 0.0064support 0.0053…
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 66
Theme Evolution Graph: KDDT
SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005…
decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005…
Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007…
Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004…
……
1999
…
web 0.009classifica –tion 0.007features0.006topic 0.005…
mixture 0.005random 0.006cluster 0.006clustering 0.005variables 0.005… topic 0.010
mixture 0.008LDA 0.006 semantic 0.005…
…
2000 2001 2002 2003 2004
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 67
Blog Sentiment Summary (query=“Da Vinci Code”)
Neutral Positive Negative
Facet 1:Movie
... Ron Howards selection of Tom Hanks to play Robert Langdon.
Tom Hanks stars in the movie,who can be mad at that?
But the movie might get delayed, and even killed off if he loses.
Directed by: Ron Howard Writing credits: Akiva Goldsman ...
Tom Hanks, who is my favorite movie star act the leading role.
protesting ... will lose your faith by ... watching the movie.
After watching the movie I went online and some research on ...
Anybody is interested in it?
... so sick of people making such a big deal about a FICTION book and movie.
Facet 2:Book
I remembered when i first read the book, I finished the book in two days.
Awesome book. ... so sick of people making such a big deal about a FICTION book and movie.
I’m reading “Da Vinci Code” now.
…
So still a good book to past time.
This controversy book cause lots conflict in west society.
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 68
Results: Sentiment Dynamics
Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg )
Facet: religious beliefs ( Bursts during the movie, Neg > Pos )
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 69
Event Impact Analysis: IR Research
vector 0.0514concept 0.0298extend 0.0297 model 0.0291space 0.0236boolean 0.0151function 0.0123feedback 0.0077…
xml 0.0678email 0.0197 model 0.0191collect 0.0187judgment 0.0102rank 0.0097subtopic 0.0079…
probabilist 0.0778model 0.0432logic 0.0404 ir 0.0338boolean 0.0281algebra 0.0200estimate 0.0119weight 0.0111…
model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268probable 0.0205smooth 0.0198markov 0.0137likelihood 0.0059…
1998
Publication of the paper “A language modeling approach to information retrieval”
Starting of the TREC conferences
year1992
term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…
Theme: retrieval models
SIGIR papersSIGIR papers
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 70
Temporal-Author-Topic Analysis
pattern 0.1107frequent 0.0406frequent-pattern 0.039 sequential 0.0360pattern-growth 0.0203constraint 0.0184push 0.0138…
project 0.0444itemset 0.0433intertransaction 0.0397 support 0.0264associate 0.0258frequent 0.0181closet 0.0176prefixspan 0.0170…
research 0.0551next 0.0308transaction 0.0308 panel 0.0275technical 0.0275article 0.0258revolution 0.0154innovate 0.0154…
close 0.0805pattern 0.0720sequential 0.0462 min_support 0.0353threshold 0.0207top-k 0.0176fp-tree 0.0102…
index 0.0440graph 0.0343web 0.0307gspan 0.0273substructure 0.0201gindex 0.0164bide 0.0115xml 0.0109…
2000time
Author
Author B
Author AGlobal theme: frequent patterns
Jiawei Han
Rakesh Agrawal
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 71
Modeling Topical Communities (Mei et al. 08)
71
Community 1: Information Retrieval
Community 2: Data Mining
Community 3: Machine Learning
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 72
Other Extensions (LDA Extensions)
• Many extensions of LDA, mostly done by David Blei, Andrew McCallum and their co-authors
• Some examples:– Hierarchical topic models [Blei et al. 03]
– Modeling annotated data [Blei & Jordan 03]
– Dynamic topic models [Blei & Lafferty 06]
– Pachinko allocation [Li & McCallum 06])
• Also, some specific context extension of PLSA, e.g., author-topic model [Steyvers et al. 04]
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 73
Future Research Directions
• Topic models for text mining– Evaluation of topic models
– Improve the efficiency of estimation and inferences
– Incorporate linguistic knowledge
– Applications in new domains and for new tasks
• Text mining in general– Combination of NLP-style and DM-style mining algorithms
– Integrated mining of text (unstructured) and unstructured data (e.g., Text OLAP)
– Interactive mining:
• Incorporate user constraints and support iterative mining
• Design and implement mining languages
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 74
Lecture 5: Key Points
• Topic models coupled with topic labeling are quite useful for extracting and modeling subtopics in text
• Adding context variables significantly increases a topic model’s capacity of performing text mining
– Enable interpretation of topics in context
– Accommodate variation analysis and correlation analysis of topics over context
• User’s preferences and domain knowledge can be added as prior or soft constraint
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 75
Readings
• PLSA:
– http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf
• LDA:
– http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf
– Many recent extensions, mostly done by David Blei and Andrew McCallums
• CPLSA:
– http://sifaka.cs.uiuc.edu/czhai/pub/kdd06-mix.pdf
– http://sifaka.cs.uiuc.edu/czhai/pub/www08-net.pdf
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 76
Discussion
• Topic models for mining multimedia data
– Simultaneous modeling of text and images
• Cross-media analysis
– Text provides context to analyze images and vice versa
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 77
Course Summary
Statistics
Machine Learning
Natural Language Processing
Scope of the course
Looking forward to collaborations…
Computer Vision
Information Retrieval
Multimedia Data Text DataRetrieval models/framework
EvaluationFeedback
Contextual topic models
1. Evaluation2. User modeling3. Ranking4. Learning with little
supervision
Integrated Multimedia Data Analysis-Mutual reinforcement (e.g., text images)-Simultaneous mining of text + images +video…
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 78
Thank You!