76
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Note: many of the slides on topic models were adapted from the presentation by Griffiths and Steyvers at the Beckman National Academy of Sciences Symposium on “Mapping Knowledge Domains”, Beckman Center, UC Irvine, May 2003. Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Embed Size (px)

DESCRIPTION

ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Note: many of the slides on topic models were adapted from the presentation by Griffiths and Steyvers at the Beckman National Academy of Sciences Symposium on “Mapping Knowledge Domains”, Beckman Center, UC Irvine, May 2003. - PowerPoint PPT Presentation

Citation preview

Page 1: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

ICS 278: Data Mining

Lecture 14: Document Clustering and Topic Extraction

Note: many of the slides on topic models were adapted from the presentation by Griffiths and Steyvers at the

Beckman National Academy of Sciences Symposium on “Mapping Knowledge Domains”, Beckman Center, UC

Irvine, May 2003.

Padhraic SmythDepartment of Information and Computer Science

University of California, Irvine

Page 2: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Text Mining

• Information Retrieval

• Text Classification

• Text Clustering

• Information Extraction

Page 3: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Document Clustering

• Set of documents D in term-vector form– no class labels this time– want to group the documents into K groups or into a taxonomy– Each cluster hypothetically corresponds to a “topic”

• Methods:– Any of the well-known clustering methods– K-means

• E.g., “spherical k-means”, normalize document distances

– Hierarchical clustering– Probabilistic model-based clustering methods

• e.g., mixtures of multinomials

• Single-topic versus multiple-topic models– Extensions to author-topic models

Page 4: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Mixture Model Clustering

k

K

k

kkcpp

1

, )|()( xx

Page 5: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Mixture Model Clustering

k

K

k

kkcpp

1

, )|()( xx

d

j

kjkk cxpcp1

, )|()|( x

Page 6: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Mixture Model Clustering

k

K

k

kkcpp

1

, )|()( xx

d

j

kjkk cxpcp1

, )|()|( x

Conditional Independencemodel for each component(often quite useful to first-order)

Page 7: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Mixtures of Documents

1 1 1 1

1 1 1 11

1

1 1 1

1

11

1 1 1

1

1

1

1 1

1

1

11 1 1

1

1

1

1

1

1

1

1 1 1

1

1

1

Terms

Documents

1

1

1

1

Component 1

Component 2

Page 8: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

1 1 1 1

1 1 1 11

1

1 1 1

1

11

1 1 1

1

1

1

1 1

1

1

11 1 1

1

1

1

1

1

1

1

1 1 1

1

1

1

Terms

Documents

1

1

1

1

Page 9: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

1 1 1 1

1 1 1 11

1

1 1 1

1

11

1 1 1

1

1

1

1 1

1

1

11 1 1

1

1

1

1

1

1

1

1 1 1

1

1

1

Terms

Documents

C1

C1

C1

C1

C1

C1

C1

C2

C2

C2

C2

C2

C2

C2

1

1

1

1

Treat as Missing

C2

Page 10: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

1 1 1 1

1 1 1 11

1

1 1 1

1

11

1 1 1

1

1

1

1 1

1

1

11 1 1

1

1

1

1

1

1

1

1 1 1

1

1

1

Terms

Documents

C1

C1

C1

C1

C1

C1

C1

C2

C2

C2

C2

C2

C2

C2

1

1

1

1

Treat as Missing

P(C1|x1)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)P(C1|..)

P(C2|x1)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)P(C2|..)

E-Step: estimate componentmembership probabilities given current parameter estimates

Page 11: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

1 1 1 1

1 1 1 11

1

1 1 1

1

11

1 1 1

1

1

1

1 1

1

1

11 1 1

1

1

1

1

1

1

1

1 1 1

1

1

1

Terms

Documents

C1

C1

C1

C1

C1

C1

C1

C2

C2

C2

C2

C2

C2

C2

1

1

1

1

Treat as Missing

P(C1|x1)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)

P(C1|..)P(C1|..)

P(C2|x1)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)

P(C2|..)P(C2|..)

M-Step: use “fractional” weighted datato get new estimates of the parameters

Page 12: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A Document Cluster

Most Likely Terms in Component 5: weight = 0.08 TERM p(t|k) write 0.571 drive 0.465 problem 0.369 mail 0.364 articl 0.332 hard 0.323 work 0.319 system 0.303 good 0.296 time 0.273

Highest Lift Terms in Component 5 weight = 0.08 TERM LIFT p(t|k) p(t) scsi 7.7 0.13 0.02 drive 5.7 0.47 0.08 hard 4.9 0.32 0.07 card 4.2 0.23 0.06 format 4.0 0.12 0.03 softwar 3.8 0.21 0.05 memori 3.6 0.14 0.04 install 3.6 0.14 0.04 disk 3.5 0.12 0.03

engin 3.3 0.21 0.06

Page 13: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Another Document

Cluster

Most Likely Terms in Component 1weight = 0.11 : TERM p(t|k) articl 0.684 good 0.368 dai 0.363 fact 0.322 god 0.320 claim 0.294 apr 0.279 fbi 0.256 christian 0.256 group 0.239

Highest Lift Terms in Component 1: weight = 0.11 : TERM LIFT p(t|k) p(t) fbi 8.3 0.26 0.03 jesu 5.5 0.16 0.03 fire 5.2 0.20 0.04 christian 4.9 0.26 0.05 evid 4.8 0.24 0.05 god 4.6 0.32 0.07 gun 4.2 0.17 0.04 faith 4.2 0.12 0.03 kill 3.8 0.22 0.06 bibl 3.7 0.11 0.03

Page 14: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A topic is represented as a (multinomial) distribution over words

SPEECH .0691 WORDS .0671 RECOGNITION .0412 WORD .0557

SPEAKER .0288 USER .0230 PHONEME .0224 DOCUMENTS .0205

CLASSIFICATION .0154 TEXT .0195 SPEAKERS .0140 RETRIEVAL .0152

FRAME .0135 INFORMATION .0144 PHONETIC .0119 DOCUMENT .0144

PERFORMANCE .0111 LARGE .0102 ACOUSTIC .0099 COLLECTION .0098

BASED .0098 KNOWLEDGE .0087 PHONEMES .0091 MACHINE .0080

UTTERANCES .0091 RELEVANT .0077 SET .0089 SEMANTIC .0076

LETTER .0088 SIMILARITY .0071

… …

Example topic #1 Example topic #2

Page 15: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

The basic model….

C

X1 X2 Xd

Page 16: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A better model….

A

X1 X2 Xd

B C

Page 17: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A better model….

A

X1 X2 Xd

B C

Inference can be intractable due to undirected loops!

Page 18: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A better model for documents….

• Multi-topic model– A document is generated from multiple components

– Multiple components can be active at once

– Each component = multinomial distribution

– Parameter estimation is tricky

– Very useful: • “parses” into high-level semantic components

Page 19: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

History of multi-topic models

• Latent class models in statistics• Hoffman 1999

– Original application to documents

• Blei, Ng, and Jordan (2001, 2003)– Variational methods

• Griffiths and Steyvers (2003)– Gibbs sampling approach (very efficient)

Page 20: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

1 2 3 4 GROUP 0.057185 DYNAMIC 0.152141 DISTRIBUTED 0.192926 RESEARCH 0.066798 MULTICAST 0.051620 STRUCTURE 0.137964 COMPUTING 0.044376 SUPPORTED 0.043233 INTERNET 0.049499 STRUCTURES 0.088040 SYSTEMS 0.038601 PART 0.035590 PROTOCOL 0.041615 STATIC 0.043452 SYSTEM 0.031797 GRANT 0.034476 RELIABLE 0.020877 PAPER 0.032706 HETEROGENEOUS 0.030996 SCIENCE 0.023250 GROUPS 0.019552 DYNAMICALLY 0.023940 ENVIRONMENT 0.023163 FOUNDATION 0.022653 PROTOCOLS 0.019088 PRESENT 0.015328 PAPER 0.017960 FL 0.021220 IP 0.014980 META 0.015175 SUPPORT 0.016587 WORK 0.021061 TRANSPORT 0.012529 CALLED 0.011669 ARCHITECTURE 0.016416 NATIONAL 0.019947 DRAFT 0.009945 RECURSIVE 0.010145 ENVIRONMENTS 0.013271 NSF 0.018116

“Content” components

“Boilerplate” components

Page 21: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

5 6 7 8 DIMENSIONAL 0.038901 RULES 0.090569 ORDER 0.192759 GRAPH 0.095687 POINTS 0.037263 CLASSIFICATION 0.062699 TERMS 0.048688 PATH 0.061784 SURFACE 0.031438 RULE 0.062174 PARTIAL 0.044907 GRAPHS 0.061217 GEOMETRIC 0.025006 ACCURACY 0.028926 HIGHER 0.041284 PATHS 0.030151 SURFACES 0.020152 ATTRIBUTES 0.023090 REDUCTION 0.035061 EDGE 0.028590 MESH 0.016875 INDUCTION 0.021909 PAPER 0.028602 NUMBER 0.022775 PLANE 0.013902 CLASSIFIER 0.019418 TERM 0.018204 CONNECTED 0.016817 POINT 0.013780 SET 0.018303 ORDERING 0.017652 DIRECTED 0.014405 GEOMETRY 0.013780 ATTRIBUTE 0.016204 SHOW 0.017022 NODES 0.013625 PLANAR 0.012385 CLASSIFIERS 0.015417 MAGNITUDE 0.015526 VERTICES 0.013554

9 10 11 12 INFORMATION 0.281237 SYSTEM 0.143873 PAPER 0.077870 LANGUAGE 0.158786 TEXT 0.048675 FILE 0.054076 CONDITIONS 0.041187 PROGRAMMING 0.097186 RETRIEVAL 0.044046 OPERATING 0.053963 CONCEPT 0.036268 LANGUAGES 0.082410 SOURCES 0.029548 STORAGE 0.039072 CONCEPTS 0.033457 FUNCTIONAL 0.032815 DOCUMENT 0.029000 DISK 0.029957 DISCUSSED 0.027414 SEMANTICS 0.027003 DOCUMENTS 0.026503 SYSTEMS 0.029221 DEFINITION 0.024673 SEMANTIC 0.024341 RELEVANT 0.018523 KERNEL 0.028655 ISSUES 0.024603 NATURAL 0.016410 CONTENT 0.016574 ACCESS 0.018293 PROPERTIES 0.021511 CONSTRUCTS 0.014129 AUTOMATICALLY 0.009326 MANAGEMENT 0.017218 IMPORTANT 0.021370 GRAMMAR 0.013640 DIGITAL 0.008777 UNIX 0.016878 EXAMPLES 0.019754 LISP 0.010326

Page 22: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

13 14 15 16 MODEL 0.429185 PAPER 0.050411 TYPE 0.088650 KNOWLEDGE 0.212603 MODELS 0.201810 APPROACHES 0.045245 SPECIFICATION 0.051469 SYSTEM 0.090852 MODELING 0.066311 PROPOSED 0.043132 TYPES 0.046571 SYSTEMS 0.051978 QUALITATIVE 0.018417 CHANGE 0.040393 FORMAL 0.036892 BASE 0.042277 COMPLEX 0.009272 BELIEF 0.025835 VERIFICATION 0.029987 EXPERT 0.020172 QUANTITATIVE 0.005662 ALTERNATIVE 0.022470 SPECIFICATIONS 0.024439 ACQUISITION 0.017816 CAPTURE 0.005301 APPROACH 0.020905 CHECKING 0.024439 DOMAIN 0.016638 MODELED 0.005301 ORIGINAL 0.019026 SYSTEM 0.023259 INTELLIGENT 0.015737 ACCURATELY 0.004639 SHOW 0.017852 PROPERTIES 0.018242 BASES 0.015390 REALISTIC 0.004278 PROPOSE 0.016991 ABSTRACT 0.016826 BASED 0.014004

“Style” components

Page 23: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A generative model for documents

• Each document a mixture of topics• Each word chosen from a single topic

• from parameters• from parameters

(Blei, Ng, & Jordan, 2003)

Page 24: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A generative model for documents

• Called Latent Dirichlet Allocation (LDA)• Introduced by Blei, Ng, and Jordan (2003),

reinterpretation of PLSI (Hofmann, 2001)

z

w

zz

w w

Page 25: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

wor

ds

documents

U D V

wor

ds

dims

dims

dim

s

vect

ors documents

SVD

wor

ds

documents

wor

ds

topics

topi

csdocuments

LDA

P(w

|z)

P(z)P(w)

(Dumais, Landauer)

Page 26: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

A generative model for documents

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2SCIENTIFIC 0.0KNOWLEDGE 0.0WORK 0.0RESEARCH 0.0MATHEMATICS 0.0

HEART 0.0 LOVE 0.0SOUL 0.0TEARS 0.0JOY 0.0 SCIENTIFIC 0.2KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

topic 1 topic 2

w P(w|z = 1) = (1) w P(w|z = 2) = (2)

Page 27: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Choose mixture weights for each document, generate “bag of words”

= {P(z = 1), P(z = 2)}

{0, 1}

{0.25, 0.75}

{0.5, 0.5}

{0.75, 0.25}

{1, 0}

MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK

SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART

MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART

WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL

TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY

Page 28: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Bayesian inference

• Sum in the denominator over Tn terms

• Full posterior only tractable to a constant

Page 29: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Bayesian sampling

• Sample from a Markov chain which converges to the target distribution of interest– Known as Markov chain Monte Carlo in general

• Simple version is known as Gibbs sampling– Say we are interested in estimating p(x, y | D)– We can approximate this by sampling from p(x|y,D), p(y|x,D) in an

iterative fashion– Useful when conditionals are known, but joint distribution is not easy to

work with– Converges to true distribution under fairly broad assumptions

• Can compute approximate statistics from intractable distributions

Page 30: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Gibbs sampling

• Need full conditional distributions for variables• Since we only sample z we need

number of times word w assigned to topic j

number of times topic j used in document d

Page 31: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Gibbs sampling

i wi di zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

iteration1

Page 32: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Gibbs sampling

i wi di zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

?

iteration1 2

Page 33: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Gibbs sampling

i wi di zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

?

iteration1 2

Page 34: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Gibbs sampling

i wi di zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

?

iteration1 2

Page 35: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Gibbs sampling

i wi di zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

2?

iteration1 2

Page 36: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Gibbs sampling

i wi di zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

21?

iteration1 2

Page 37: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Gibbs sampling

i wi di zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

211?

iteration1 2

Page 38: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Gibbs sampling

i wi di zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

2112?

iteration1 2

Page 39: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Gibbs sampling

i wi di zi zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

211222212212...1

222122212222...1

iteration1 2 … 1000

Page 40: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

pixel = word image = document

sample each pixel froma mixture of topics

A visual example: Bars

Page 41: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Page 42: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Page 43: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Interpretable decomposition

• SVD gives a basis for the data, but not an interpretable one

• The true basis is not orthogonal, so rotation does no good

Page 44: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Bayesian model selection

• How many topics T do we need?

• A Bayesian would consider the posterior:

• P(w|T) involves summing over all possible assignments z– but it can be approximated by sampling

P(T|w) P(w|T) P(T)

Page 45: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Corpus (w)

P(

w |T

)

T = 10

T = 100

Bayesian model selection

Page 46: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Corpus (w)

P(

w |T

)

T = 10

T = 100

Bayesian model selection

Page 47: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Corpus (w)

P(

w |T

)

T = 10

T = 100

Bayesian model selection

Page 48: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Back to the bars data set

Page 49: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

PNAS corpus preprocessing

• Used all D = 28,154 abstracts from 1991-2001• Used any word occurring in at least five abstracts, not on

“stop” list (W = 20,551)• Segmentation by any delimiting character, total of n =

3,026,970 word tokens in corpus• Also, PNAS class designations for 2001

Page 50: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Running the algorithm

• Memory requirements linear in T(W+D), runtime proportional to nT

• T = 50, 100, 200, 300, 400, 500, 600, (1000)• Ran 8 chains for each T, burn-in of 1000 iterations, 10

samples/chain at a lag of 100• All runs completed in under 30 hours on BlueHorizon

supercomputer at San Diego

Page 51: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

FORCESURFACE

MOLECULESSOLUTIONSURFACES

MICROSCOPYWATERFORCES

PARTICLESSTRENGTHPOLYMER

IONICATOMIC

AQUEOUSMOLECULARPROPERTIES

LIQUIDSOLUTIONS

BEADSMECHANICAL

HIVVIRUS

INFECTEDIMMUNODEFICIENCY

CD4INFECTION

HUMANVIRAL

TATGP120

REPLICATIONTYPE

ENVELOPEAIDSREV

BLOODCCR5

INDIVIDUALSENV

PERIPHERAL

MUSCLECARDIAC

HEARTSKELETALMYOCYTES

VENTRICULARMUSCLESSMOOTH

HYPERTROPHYDYSTROPHIN

HEARTSCONTRACTION

FIBERSFUNCTION

TISSUERAT

MYOCARDIALISOLATED

MYODFAILURE

STRUCTUREANGSTROM

CRYSTALRESIDUES

STRUCTURESSTRUCTURALRESOLUTION

HELIXTHREE

HELICESDETERMINED

RAYCONFORMATION

HELICALHYDROPHOBIC

SIDEDIMENSIONALINTERACTIONS

MOLECULESURFACE

NEURONSBRAIN

CORTEXCORTICAL

OLFACTORYNUCLEUS

NEURONALLAYER

RATNUCLEI

CEREBELLUMCEREBELLAR

LATERALCEREBRAL

LAYERSGRANULELABELED

HIPPOCAMPUSAREAS

THALAMIC

A selection of topics

TUMORCANCERTUMORSHUMANCELLS

BREASTMELANOMA

GROWTHCARCINOMA

PROSTATENORMAL

CELLMETASTATICMALIGNANT

LUNGCANCERS

MICENUDE

PRIMARYOVARIAN

Page 52: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

PARASITEPARASITES

FALCIPARUMMALARIA

HOSTPLASMODIUM

ERYTHROCYTESERYTHROCYTE

MAJORLEISHMANIA

INFECTEDBLOOD

INFECTIONMOSQUITOINVASION

TRYPANOSOMACRUZI

BRUCEIHUMANHOSTS

ADULTDEVELOPMENT

FETALDAY

DEVELOPMENTALPOSTNATAL

EARLYDAYS

NEONATALLIFE

DEVELOPINGEMBRYONIC

BIRTHNEWBORN

MATERNALPRESENTPERIOD

ANIMALSNEUROGENESIS

ADULTS

CHROMOSOMEREGION

CHROMOSOMESKB

MAPMAPPING

CHROMOSOMALHYBRIDIZATION

ARTIFICIALMAPPED

PHYSICALMAPS

GENOMICDNA

LOCUSGENOME

GENEHUMAN

SITUCLONES

MALEFEMALEMALES

FEMALESSEX

SEXUALBEHAVIOROFFSPRING

REPRODUCTIVEMATINGSOCIALSPECIES

REPRODUCTIONFERTILITY

TESTISMATE

GENETICGERM

CHOICESRY

STUDIESPREVIOUS

SHOWNRESULTSRECENTPRESENT

STUDYDEMONSTRATED

INDICATEWORK

SUGGESTSUGGESTED

USINGFINDINGS

DEMONSTRATEREPORT

INDICATEDCONSISTENT

REPORTSCONTRAST

A selection of topics

MECHANISMMECHANISMSUNDERSTOOD

POORLYACTION

UNKNOWNREMAIN

UNDERLYINGMOLECULAR

PSREMAINS

SHOWRESPONSIBLE

PROCESSSUGGESTUNCLEARREPORT

LEADINGLARGELYKNOWN

MODELMODELS

EXPERIMENTALBASED

PROPOSEDDATA

SIMPLEDYNAMICSPREDICTED

EXPLAINBEHAVIOR

THEORETICALACCOUNTTHEORY

PREDICTSCOMPUTER

QUANTITATIVEPREDICTIONSCONSISTENT

PARAMETERS

Page 53: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

PARASITEPARASITES

FALCIPARUMMALARIA

HOSTPLASMODIUM

ERYTHROCYTESERYTHROCYTE

MAJORLEISHMANIA

INFECTEDBLOOD

INFECTIONMOSQUITOINVASION

TRYPANOSOMACRUZI

BRUCEIHUMANHOSTS

ADULTDEVELOPMENT

FETALDAY

DEVELOPMENTALPOSTNATAL

EARLYDAYS

NEONATALLIFE

DEVELOPINGEMBRYONIC

BIRTHNEWBORN

MATERNALPRESENTPERIOD

ANIMALSNEUROGENESIS

ADULTS

CHROMOSOMEREGION

CHROMOSOMESKB

MAPMAPPING

CHROMOSOMALHYBRIDIZATION

ARTIFICIALMAPPED

PHYSICALMAPS

GENOMICDNA

LOCUSGENOME

GENEHUMAN

SITUCLONES

MALEFEMALEMALES

FEMALESSEX

SEXUALBEHAVIOROFFSPRING

REPRODUCTIVEMATINGSOCIALSPECIES

REPRODUCTIONFERTILITY

TESTISMATE

GENETICGERM

CHOICESRY

STUDIESPREVIOUS

SHOWNRESULTSRECENTPRESENT

STUDYDEMONSTRATED

INDICATEWORK

SUGGESTSUGGESTED

USINGFINDINGS

DEMONSTRATEREPORT

INDICATEDCONSISTENT

REPORTSCONTRAST

A selection of topics

MECHANISMMECHANISMSUNDERSTOOD

POORLYACTION

UNKNOWNREMAIN

UNDERLYINGMOLECULAR

PSREMAINS

SHOWRESPONSIBLE

PROCESSSUGGESTUNCLEARREPORT

LEADINGLARGELYKNOWN

MODELMODELS

EXPERIMENTALBASED

PROPOSEDDATA

SIMPLEDYNAMICSPREDICTED

EXPLAINBEHAVIOR

THEORETICALACCOUNTTHEORY

PREDICTSCOMPUTER

QUANTITATIVEPREDICTIONSCONSISTENT

PARAMETERS

Page 54: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

How many topics?

Page 55: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Page 56: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Scientific syntax and semantics

z

w

zz

w w

xxx

semantics: probabilistic topics

syntax: probabilistic regular grammar

Factorization of language based onstatistical dependency patterns:

long-range, document specificdependencies

short-range dependencies constantacross all documents

Page 57: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Page 58: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

THE 0.6 A 0.3MANY 0.1

OF 0.6 FOR 0.3BETWEEN 0.1

0.9

0.1

0.2

0.8

0.7

0.3

THE ………………………………

z = 1 0.4 z = 2 0.6

x = 1

x = 3

x = 2

Page 59: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

THE 0.6 A 0.3MANY 0.1

OF 0.6 FOR 0.3BETWEEN 0.1

0.9

0.1

0.2

0.8

0.7

0.3

THE LOVE……………………

z = 1 0.4 z = 2 0.6

x = 1

x = 3

x = 2

Page 60: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

THE 0.6 A 0.3MANY 0.1

OF 0.6 FOR 0.3BETWEEN 0.1

0.9

0.1

0.2

0.8

0.7

0.3

THE LOVE OF………………

z = 1 0.4 z = 2 0.6

x = 1

x = 3

x = 2

Page 61: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

THE 0.6 A 0.3MANY 0.1

OF 0.6 FOR 0.3BETWEEN 0.1

0.9

0.1

0.2

0.8

0.7

0.3

THE LOVE OF RESEARCH ……

z = 1 0.4 z = 2 0.6

x = 1

x = 3

x = 2

Page 62: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Semantic topics

29 46 51 71 115 125AGE SELECTION LOCI TUMOR MALE MEMORYLIFE POPULATION LOCUS CANCER FEMALE LEARNING

AGING SPECIES ALLELES TUMORS MALES BRAINOLD POPULATIONS ALLELE BREAST FEMALES TASK

YOUNG GENETIC GENETIC HUMAN SPERM CORTEXCRE EVOLUTION LINKAGE CARCINOMA SEX SUBJECTS

AGED SIZE POLYMORPHISM PROSTATE SEXUAL LEFTSENESCENCE NATURAL CHROMOSOME MELANOMA MATING RIGHTMORTALITY VARIATION MARKERS CANCERS REPRODUCTIVE SONG

AGES FITNESS SUSCEPTIBILITY NORMAL OFFSPRING TASKSCR MUTATION ALLELIC COLON PHEROMONE HIPPOCAMPAL

INFANTS PER POLYMORPHIC LUNG SOCIAL PERFORMANCESPAN NUCLEOTIDE POLYMORPHISMS APC EGG SPATIALMEN RATES RESTRICTION MAMMARY BEHAVIOR PREFRONTAL

WOMEN RATE FRAGMENT CARCINOMAS EGGS COGNITIVESENESCENT HYBRID HAPLOTYPE MALIGNANT FERTILIZATION TRAINING

LOXP DIVERSITY GENE CELL MATERNAL TOMOGRAPHYINDIVIDUALS SUBSTITUTION LENGTH GROWTH PATERNAL FRONTAL

CHILDREN SPECIATION DISEASE METASTATIC FERTILITY MOTORNORMAL EVOLUTIONARY MICROSATELLITE EPITHELIAL GERM EMISSION

Page 63: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Syntactic classes

REMAINED

5 8 14 25 26 30 33IN ARE THE SUGGEST LEVELS RESULTS BEEN

FOR WERE THIS INDICATE NUMBER ANALYSIS MAYON WAS ITS SUGGESTING LEVEL DATA CAN

BETWEEN IS THEIR SUGGESTS RATE STUDIES COULDDURING WHEN AN SHOWED TIME STUDY WELLAMONG REMAIN EACH REVEALED CONCENTRATIONS FINDINGS DIDFROM REMAINS ONE SHOW VARIETY EXPERIMENTS DOES

UNDER REMAINED ANY DEMONSTRATE RANGE OBSERVATIONS DOWITHIN PREVIOUSLY INCREASED INDICATING CONCENTRATION HYPOTHESIS MIGHT

THROUGHOUT BECOME EXOGENOUS PROVIDE DOSE ANALYSES SHOULDTHROUGH BECAME OUR SUPPORT FAMILY ASSAYS WILLTOWARD BEING RECOMBINANT INDICATES SET POSSIBILITY WOULD

INTO BUT ENDOGENOUS PROVIDES FREQUENCY MICROSCOPY MUSTAT GIVE TOTAL INDICATED SERIES PAPER CANNOT

INVOLVING MERE PURIFIED DEMONSTRATED AMOUNTS WORK

THEYAFTER APPEARED TILE SHOWS RATES EVIDENCE ALSO

ACROSS APPEAR FULL SO CLASS FINDINGAGAINST ALLOWED CHRONIC REVEAL VALUES MUTAGENESIS BECOME

WHEN NORMALLY ANOTHER DEMONSTRATES AMOUNT OBSERVATION MAGALONG EACH EXCESS SUGGESTED SITES MEASUREMENTS LIKELY

Page 64: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

(PNAS, 1991, vol. 88, 4874-4876)

A23 generalized49 fundamental11 theorem20 of4 natural46 selection46 is32 derived17 for5 populations46 incorporating22 both39 genetic46 and37 cultural46 transmission46. The14 phenotype15 is32 determined17 by42 an23 arbitrary49 number26 of4 multiallelic52 loci40 with22 two39-factor148 epistasis46 and37 an23 arbitrary49 linkage11 map20, as43 well33 as43 by42 cultural46 transmission46 from22 the14 parents46. Generations46 are8 discrete49 but37 partially19 overlapping24, and37 mating46 may33 be44 nonrandom17 at9 either39 the14 genotypic46 or37 the14 phenotypic46 level46 (or37 both39). I12 show34 that47 cultural46 transmission46 has18 several39 important49 implications6 for5 the14 evolution46 of4 population46 fitness46, most36 notably4 that47 there41 is32 a23 time26 lag7 in22 the14 response28 to31 selection46 such9 that47 the14 future137 evolution46 depends29 on21 the14 past24 selection46 history46 of4 the14 population46.

(graylevel = “semanticity”, the probability of using LDA over HMM)

Page 65: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

(PNAS, 1996, vol. 93, 14628-14631)

The14 ''shape7'' of4 a23 female115 mating115 preference125 is32 the14 relationship7 between4 a23 male115 trait15 and37 the14 probability7 of4 acceptance21 as43 a23 mating115 partner20, The14 shape7 of4 preferences115 is32 important49 in5 many39 models6 of4 sexual115 selection46, mate115 recognition125, communication9, and37 speciation46, yet50 it41 has18 rarely19 been33 measured17 precisely19, Here12 I9 examine34 preference7 shape7 for5 male115 calling115 song125 in22 a23 bushcricket*13 (katydid*48). Preferences115 change46 dramatically19 between22 races46 of4 a23 species15, from22 strongly19 directional11 to31 broadly19 stabilizing45 (but50 with21 a23 net49 directional46 effect46), Preference115 shape46 generally19 matches10 the14 distribution16 of4 the14 male115 trait15, This41 is32 compatible29 with21 a23 coevolutionary46 model20 of4 signal9-preference115 evolution46, although50 it41 does33 nor37 rule20 out17 an23 alternative11 model20, sensory125 exploitation150. Preference46 shapes40 are8 shown35 to31 be44 genetic11 in5 origin7.

Page 66: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

(PNAS, 1996, vol. 93, 14628-14631)

The14 ''shape7'' of4 a23 female115 mating115 preference125 is32 the14 relationship7 between4 a23 male115 trait15 and37 the14 probability7 of4 acceptance21 as43 a23 mating115 partner20, The14 shape7 of4 preferences115 is32 important49 in5 many39 models6 of4 sexual115 selection46, mate115 recognition125, communication9, and37 speciation46, yet50 it41 has18 rarely19 been33 measured17 precisely19, Here12 I9 examine34 preference7 shape7 for5 male115 calling115 song125 in22 a23 bushcricket*13 (katydid*48). Preferences115 change46 dramatically19 between22 races46 of4 a23 species15, from22 strongly19 directional11 to31 broadly19 stabilizing45 (but50 with21 a23 net49 directional46 effect46), Preference115 shape46 generally19 matches10 the14 distribution16 of4 the14 male115 trait15. This41 is32 compatible29 with21 a23 coevolutionary46 model20 of4 signal9-preference115 evolution46, although50 it41 does33 nor37 rule20 out17 an23 alternative11 model20, sensory125 exploitation150. Preference46 shapes40 are8 shown35 to31 be44 genetic11 in5 origin7.

Page 67: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

End of presentation on topic models…

…. switch now to Author-topic model

Page 68: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Recent Results on Author-Topic Models

Page 69: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

w 1

A 1 A 2

w 2

A k

w 3 w N

Authors

Words

Can we model authors, given documents?

(more generally, build statistical profiles of entitiesgiven sparse observed data)

Page 70: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Model = Author-Topic distributions + Topic-Word distributions

Parameters learned via Bayesian learning

Page 71: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Page 72: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Page 73: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Page 74: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Page 75: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Page 76: Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics