Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
23/05/16 2
Conference Dinner
23/05/16 3
Conference dinner
- I sit at a table with a probability proportional to the number
of people already sitting there
23/05/16 4
Conference dinner
- I sit at a table with a probability proportional to the number
of people already sitting there
- If everybody does the same and there are more and more
people entering, the probabilities for choosing the tables
converge
23/05/16 5
Conference dinner
- I sit at a table with a probability proportional to the number
of people already sitting there
- If everybody does the same and there are more and more
people entering, the probabilities for choosing the tables
converge
- The scheme yields a sample of a Dirichlet distribution
Parameters: initial number of participants at each table
23/05/16 6
Dirichlet Distribution
- The scheme yields a sample of a Dirichlet distribution
Parameters: initial number of participants at each table
- “rich get richer”, preferential attachment
- Initial settings of < 1 participant at each table produce
sparse distributions
23/05/16 7
In reality, I choose tables based on the number of people AND the topic they talk about!
23/05/16 8
In reality, I choose tables based on the number of people AND the topic they talk about!
23/05/16 9
Topics
23/05/16 10
23/05/16 11
Articles are labelled with tags(e.g. politics, economy, sports, ...)
23/05/16 12
Politics: election, party, vote, candidate, ...Economy: dollar, crisis, financial, market, …Sports: soccer, basketball, match, score, ...
Articles are labelled with tags(e.g. politics, economy, sports, ...)
23/05/16 13
Politics: election, party, vote, candidate, ...Economy: dollar, crisis, financial, market, …Sports: soccer, basketball, match, score, ...
Articles are labelled with tags(e.g. politics, economy, sports, ...)
Topics
23/05/16 14
Topic Modelling
23/05/16 15
Topic Modelling
Automatically extract topics from text documents!
23/05/16 16
Latent Semantic Analysis
23/05/16 17
Term-document matrix
high occurrencelow occurrence
23/05/16 18
Term-document matrix
term frequencies document 4
high occurrencelow occurrence
23/05/16 19
Term-document matrix
how often does document 4contain the word “blood”?
high occurrencelow occurrence
23/05/16 20
Latent Semantic Analysis (LSA)
- Topic model based on “matrix decomposition”
23/05/16 21
Latent Semantic Analysis (LSA)
- Topic model based on “matrix decomposition”
- Topics are described by “loadings” over the terms
Topic 1
23/05/16 22
The Test Dataset
23/05/16 23
probabilistic topic model probabilistic topic model probabilistic topic model probabilistic topic model probabilistic topic model probabilistic topic model
probabilistic topic model famous fashion model
famous fashion modelfamous fashion model
famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model famous fashion model
document 0: document 1: document 2: document 3: document 4: document 5: document 6: document 7: document 8: document 9: document 10: document 11: document 12: document 13: document 14: document 15: document 16: document 17: document 18: document 19:
Test dataset
23/05/16 24
Topic 1: famous, fashion, modelTopic 2: model, probabilistic, topic
Expected topics
probabilistic topic model probabilistic topic model …
famous fashion model famous fashion model
...
Test dataset
23/05/16 25
Test dataset
Term-document matrix
23/05/16 26
Test dataset
Term-document matrix
23/05/16 27
Test dataset
Term-document matrix
23/05/16 28
Topic 1 Topic 2
LSA
23/05/16 29
Topic 1 Topic 2
LSA
23/05/16 30
Topic 1 Topic 2
LSA
23/05/16 31
Topic 1 Topic 2
LSA
23/05/16 32
Topic 1 Topic 2
LSA
23/05/16 33
LSA – Weaknesses
- Topic loadings can be negative → hard to interpret!
- LSA has problems to cope with word ambiguities
23/05/16 34
Probabilistic LSA
23/05/16 35
Probabilistic LSA (PLSA)
- Based on categorical distributions
23/05/16 36
Probabilistic LSA (PLSA)
- Based on categorical distributions
- Probabilistic model that explains
the creation of documents
23/05/16 37
Probabilistic LSA (PLSA)
The PLSA model for the creation of words in documents:
1) Documents have each a categorical distribution
over the topics
23/05/16 38
Probabilistic LSA (PLSA)
The PLSA model for the creation of words in documents:
1) Documents have each a categorical distribution
over the topics
2) Topics have each a categorical distribution over
all words
23/05/16 39
Probabilistic LSA (PLSA)
The PLSA model for the creation of words in documents:
1) Documents have each a categorical distribution
over the topics
2) Topics have each a categorical distribution over
all words
3) Creation of a word in document i:
1)Draw a topic from
2)Draw a word from
23/05/16 40
Topic 1 Topic 2
Probabilistic LSA (PLSA)
23/05/16 41
Topic 1 Topic 2
Probabilistic LSA (PLSA)
23/05/16 42
Topic 1 Topic 2
Probabilistic LSA (PLSA)
23/05/16 43
Document 0 (probabilistic topic model)
Probabilistic LSA (PLSA)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Topic 1 Topic 2
Loadin
g
23/05/16 44
Document 0 (probabilistic topic model) Document 7 (famous fashion model)
Probabilistic LSA (PLSA)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Topic 1 Topic 2
Loadin
g
23/05/16 45
PLSA – Strengths & Weaknesses
- Topics are probability distributions and easy to interpret!
- PLSA still has problems to cope with ambiguous words
23/05/16 46
Latent Dirichlet Allocation
23/05/16 47
Latent Dirichlet Allocation (LDA)
- A word in a document is likely to belong to the same topic
as the other words of that document
23/05/16 48
Latent Dirichlet Allocation (LDA)
- A word in a document is likely to belong to the same topic
as the other words of that document
famous fashion modeldocument 7:
23/05/16 49
Latent Dirichlet Allocation (LDA)
- A word in a document is likely to belong to the same topic
as the other words of that document
famous fashion model
Topic 1 Topic 1 ?
document 7:
23/05/16 50
Latent Dirichlet Allocation (LDA)
- A word in a document is likely to belong to the same topic
as the other words of that document
famous fashion model
Topic 1 Topic 1 → Topic 1
document 7:
23/05/16 51
Latent Dirichlet Allocation (LDA)
- A word in a document is likely to belong to the same topic
as the other words of that document
probabilistic topic modeldocument 0:
23/05/16 52
Latent Dirichlet Allocation (LDA)
- A word in a document is likely to belong to the same topic
as the other words of that document
probabilistic topic model
Topic 2 Topic 2 ?
document 0:
23/05/16 53
Latent Dirichlet Allocation (LDA)
- A word in a document is likely to belong to the same topic
as the other words of that document
probabilistic topic model
Topic 2 Topic 2 → Topic 2
document 0:
23/05/16 54
Latent Dirichlet Allocation (LDA)
- A word in a document is likely to belong to the same topic
as the other words of that document
- We would need some preference for already assigned
topics in a document
23/05/16 55
Latent Dirichlet Allocation (LDA)
- A word in a document is likely to belong to the same topic
as the other words of that document
- We would need some preference for already assigned
topics in a document
→ Dirichlet distribution!
23/05/16 56
Dirichlet distribution
23/05/16 57
Dirichlet distribution
23/05/16 58
Topic 1 Topic 2
Latent Dirichlet Allocation (LDA)
23/05/16 59
Document 0 (probabilistic topic model) Document 7 (famous fashion model)
Probabilistic topic model (with sparse Dirichlet)
23/05/16 60
LDA – Strengths
- LDA can cope with ambiguous words!
- Most popular topic model
23/05/16 61
(Human) Evaluation
23/05/16 62
PLSA LDATopic 1 family,registered,like,hard,members,… first,network,time,won,week,third,...
Topic 2 high,left,planned,organization,story,… two,house,found,police,car,home,..
Topic 3 normal,predicted,first,chief,health,… cents,futures,cent,lower,higher,...
… … ...
23/05/16 63
Topic Model Game
- Tests the semantic coherence of topics
- Given the top-5 words of a topic and an intruder word
from a different topic – find the intruder word!
23/05/16 64
Topic Model Game
Given the top-5 words of a topic and an intruder word from a different topic – find the intruder word!
air pollution power blood environmental nuclear
23/05/16 65
Topic Model Game
Given the top-5 words of a topic and an intruder word from a different topic – find the intruder word!
air pollution power blood environmental nuclear
23/05/16 66
Topic Model Game
https://tinyurl.com/tmt16
23/05/16 67
Summary
23/05/16 68
Summary
- Dirichlet distribution (Polya urn scheme)
- Latent Semantic Analysis (LSA)
- Probabilistic Latent Semantic Analysis (PLSA)
- Latent Dirichlet Allocation (LDA)
- Human evaluation of topic models