Upload
justine-heustis
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Preserving Semantic Content in Text Mining Using Multigrams
Yasmin H. Said
Department of Computational and Data SciencesGeorge Mason University
QMDNS 2010 - May 26, 2010This is joint work with Edward J. Wegman
Outline• Background on Text Mining• Bigrams
– Term-Document and Bigram-Document Matrices
– Term-Term and Document-Document Associations
• Example using 15,863 Documents
To read between the lines is easier than to follow the text. -Henry James
Text Data Mining
• Synthesis of …– Information Retrieval
• Focuses on retrieving documents from a fixed database
• May be multimedia including text, images, video, audio
– Natural Language Processing• Usually more challenging
questions• Bag-of-words methods• Vector space models
– Statistical Data Mining• Pattern recognition, classification,
clustering
Natural Language Processing
• Key elements are:– Morphology (grammar of word forms)– Syntax (grammar of word
combinations to form sentences)– Semantics (meaning of word or
sentence)– Lexicon (vocabulary or set of words)
• Time flies like an arrow– Time passes speedily like an arrow
passes speedily or– Measure the speed of a fly like you
would measure the speed of an arrow
• Ambiguity of nouns and verbs• Ambiguity of meaning
Text Mining Tasks
• Text Classification– Assigning a document to one of several
pre-specified classes• Text Clustering
– Unsupervised learning• Text Summarization
– Extracting a summary for a document– Based on syntax and semantics
• Author Identification/Determination– Based on stylistics, syntax, and semantics
• Automatic Translation– Based on morphology, syntax, semantics,
and lexicon• Cross Corpus Discovery
– Also known as Literature Based Discovery
Preprocessing
• Denoising– Means removing stopper words
… words with little semantic meaning such as the, an, and, of, by, that and so on.
– Stopper words may be context dependent, e.g. Theorem and Proof in a mathematics document
• Stemming– Means removal suffixes, prefixes
and infixes to root– An example: wake, waking,
awake, woke wake
Bigrams and Trigrams
• A bigram is a word pair where the order of words is preserved.– The first word is the reference
word.– The second is the neighbor
word.
• A trigram is a word triple where order is preserved.
• Bigrams and trigrams are useful because they can capture semantic content.
Example
• Hell hath no fury like a woman scorned.
• Denoised: Hell hath no fury like woman scorned.
• Stemmed: Hell has no fury like woman scorn.
• Bigrams:– Hell has, has no, no fury, fury like,
like woman, woman scorn, scorn .– Note that the “.” (any sentence
ending punctuation) is treated as a word
Bigram Proximity Matrix
. fury has hell like noscor
n
woman
.
fury 1
has 1
hell 1
like 1
no 1
scorn
1
woman
1
Bigram Proximity Matrix
• The bigram proximity matrix (BPM) is computed for an entire document– Entries in the matrix may be
either binary or a frequency count
• The BPM is a mathematical representation of a document with some claim to capturing semantics– Because bigrams capture noun-
verb, adjective-noun, verb-adverb, verb-subject structures
– Martinez (2002)
Vector Space Methods
• The classic structure in vector space text mining methods is a term-document matrix where– Rows correspond to terms, columns
correspond to documents, and– Entries may be binary or frequency
counts
• A simple and obvious generalization is a bigram-document matrix where– Rows correspond to bigrams,
columns to documents, and again entries are either binary or frequency counts
Example Data• The text data were collected by the
Linguistic Data Consortium in 1997 and were originally used in Martinez (2002)– The data consisted of 15,863 news
reports collected from Reuters and CNN from July 1, 1994 to June 30, 1995
– The full lexicon for the text database included 68,354 distinct words
• In all 313 stopper words are removed• after denoising and stemming, there
remain 45,021 words in the lexicon– The example that I report here is
based on the full set of 15,863 documents. This is the same basic data set that Dr. Wegman reported on in his keynote talk although he considered a subset of 503 documents.
Vector Space Methods
• A document corpus we have worked with has 45,021 denoised and stemmed entries in its lexicon and 1,834,123 bigrams– Thus the TDM is 45,021 by
15,863 and the BDM is 1,834,123 by 15,863
– The term vector is 45,021 dimensional and the bigram vector is 1,834,123 dimensional
– The BPM for each document is 1,834,123 by 1,834,123 and, of course, very sparse.
Term-Document Matrix Analysis
Zipf’s Law
Term-Document Matrix Analysis
sai
peopl
on
north
go
said
here
two
outjust
cnntime
well
know
korea
report
getright
state
think
newoffici
korean
see
area
hous
dai
like
todai
presid
hourlook
white
citi
year
first
hall
build
still
talktake
happen
unit
be
comet
fire
want
morn
last
bomb
back
work
thank
polic
jupit
thing
kim
releas
defens
through
south
clinic
try
offic
live
told
american
good
call
make wai
pilot
join
continu
govern
case
investig
even
tell
point
mile
ago
week
feder
question
plane
john
militari
air
lot
believ
helphelicopt
simpson
hit
crash
astronom
clinton
scene
test
someth
secur
rescu
far
nation
oklahoma
hear
kind
possibl
kobe
il
expect
abl
hope
man
death
explos
inform
us
world
realli
forc
kill
number
japan
home
evid
issu
serb
fact
ask
train
situat
cours
partanyth
seen
court
come
earthquak
shot
made
incid
latest
famili
deal
countri
search
few
remainlittl
place
center
chief
yesterdai
sever
close
show
author
oper
salvi
impact
find
bloodjudg
abort
pictur
involv
attornei
earth
whether
meet
came
nuclear
charg
side
person
start
iraq
long
mean
space
problem
second
need
return
depart
york
went
night
earlier
mr
water
bosnian fragment
suspect
later
stori
ground
bobbi
appar
concern
give
bodi
left
seem
heard
shoot
leaderawai
washington
hand
sure
servic
big
found
sawdamag
injur
planet
mission
safe
car
minut
indic
flood
turn
dna
iraqi
gener
prosecut
bosnia
wordmove
against
least
respons
appear
effort
month
perhap
feelsort
five
flight
confirm
open
nato
massachusett
quak
telescop
causprobabl
fbi
done
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27 28
29
3031
3233
3435
36
37
38
3940
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
7273
74
75
7677
78
79 80 81
8283
84
85
8687
88
89
9091
92
9394
9596
97
98
99
100
101102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123124
125
126127
128
129
130
131
132
133
134
135
136
137
138
139
140141
142
143
144
145 146
147
148149
150
151
152
153
154
155
156
157
158159
160
161
162
163
164
165 166
167
168
169
170
171
172
173174
175
176
177
178179180
181 182
183
184
185186
187
188
189
190
191
192
193
194
195196197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214215
216
217
218
219
220
221222
223
224
225226
227
228
229
230
231232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277278
279
280
281
282
283
284
285
286
287
288289290
291
292 293
294
295
296
297
298
299
300301
302
303304
305
306
307
308
309
310
311312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368369
370
371
372373
374
375
376
377
378 379
380
381
382
383
384
385
386
387
388
389
390
391
392393
394
395
396
397
398
399
400
401
402
403
404
405
406407
408
409
410
411
412
413
414
415
416
417418
419
420
421
422
423
424 425
426
427
428
429
430
431
432
433
434
435436
437
438
439
440
441
442
443
444
445446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490491
492493
494495
496
497498499
500
501
502
503
Text Example - Clusters• A portion of the hierarchical agglomerative tree for the clusters
Text Example - Clusters
Cluster 0, Size: 157, ISim: 0.142, ESim: 0.008
Descriptive: ireland 12.2%, ira 9.1%, northern.ireland 7.6%, irish 5.5%, fein 5.0%, sinn 5.0%, sinn.fein 5.0%, northern 3.2%, british 3.2%, adam 2.4%
Discriminating: ireland 7.7%, ira 5.9%, northern.ireland 4.9%, irish 3.5%, fein 3.2%, sinn 3.2%, sinn.fein 3.2%, northern 1.6%, british 1.5%, adam 1.5%
Phrases 1: ireland 121, northern 119, british 116, irish 111, ira 110, peac 107, minist 104, govern 104, polit 104, talk 102
Phrases 2: northern.ireland 115, sinn.fein 95, irish.republican 94, republican.armi 91, ceas.fire 87, polit.wing 76, prime.minist 71, peac.process 66, gerri.adam 59, british.govern 50
Phrases 3: irish.republican.armi 91, prime.minist.john 47, minist.john.major 43, ira.ceas.fire 35, ira.polit.wing 34, british.prime.minist 34, sinn.fein.leader 30, rule.northern.ireland 27, british.rule.northern 27, declar.ceas.fire 26
Text Example - Clusters
Cluster 1, Size: 323, ISim: 0.128, ESim: 0.008
Descriptive: korea 19.8%, north 13.2%, korean 11.2%, north.korea 10.8%, kim 5.8%, north.korean 3.7%, nuclear 3.5%, pyongyang 2.0%, south 1.9%, south.korea 1.5%
Discriminating: korea 12.7%, north 7.4%, korean 7.2%, north.korea 7.0%, kim 3.8%, north.korean 2.4%, nuclear 1.7%, pyongyang 1.3%, south.korea 1.0%, simpson 0.8%
Phrases 1: korea 305, north 303, korean 285, south 243, unit 215, nuclear 204, offici 196, pyongyang 179, presid 167, talk 165
Phrases 2: north.korea 291, north.korean 233, south.korea 204, south.korean 147, kim.sung 108, presid.kim 83, nuclear.program 79, kim.jong 74, light.water 71, presid.clinton 69
Phrases 3: light.water.reactor 56, unit.north.korea 55, north.korea.nuclear 53, chief.warrant.offic 49, presid.kim.sung 46, leader.kim.sung 39, presid.kim.sam 37, north.korean.offici 36, warrant.offic.bobbi 35, bobbi.wayn.hall 29
Text Example - Clusters
Cluster 24, Size: 1788, ISim: 0.012, ESim: 0.007
Descriptive: school 2.2%, film 1.3%, children 1.2%, student 1.0%, percent 0.8%, compani 0.7%, kid 0.7%, peopl 0.7%, movi 0.7%, music 0.6%
Discriminating: school 2.3%, simpson 1.8%, film 1.7%, student 1.1%, presid 1.0%, serb 0.9%, children 0.8%, clinton 0.8%, movi 0.8%, music 0.8%
Phrases 1: cnn 1034, peopl 920, time 893, report 807, don 680, dai 650, look 630, call 588, live 535, lot 498
Phrases 2: littl.bit 99, lot.peopl 90, lo.angel 85, world.war 71, thank.join 67, million.dollar 60, 000.peopl 54, york.citi 50, garsten.cnn 48, san.francisco 47
Phrases 3: jeann.moo.cnn 41, cnn.entertain.new 36, cnn.jeann.moo 32, norma.quarl.cnn 30, cnn.norma.quarl 28, cnn.jeff.flock 28, jeff.flock.cnn 27, brian.cabel.cnn 26, pope.john.paul 25, lisa.price.cnn 25
BigramsBigrams
Cluster 1
Cluster Size Distribution
Document by Cluster Plot
Cluster Identities
• Cluster 02: Comet Shoemaker Levy Crashing into Jupiter.• Cluster 08: Oklahoma City Bombing.• Cluster 11: Bosnian-Serb Conflict.• Cluster 12: Court-Law, O.J. Simpson Case.• Cluster 15: Cessna Plane Crashed onto South Lawn White House.• Cluster 19: American Army Helicopter Emergency Landing in North
Korea.• Cluster 24: Death of North Korean Leader (Kim il Sung) and North
Korea’s Nuclear Ambitions.• Cluster 26: Shootings at Abortion Clinics in Boston.• Cluster 28: Two Americans Detained in Iraq. • Cluster 30: Earthquake that Hit Japan.
Bigram-Document Matrix for 50 Documents
Bigram-Bigram Matrix for 50 Documents
Bigram-Bigram Matrix Using the Top 253 Bigrams
Closing Remarks
• Text mining presents great challenges, but is amenable to statistical/mathematical approaches– Text mining using vector space
methods challenges both the mathematical and visualization issues
• especially in terms of dimensionality, sparsity, and scalability.
Acknowledgments
• Dr. Angel Martinez• Dr. Jeff Solka and Avory Bryant• Dr. Walid Sharabati• Funding Sources
– National Institute on Alcohol Abuse and Alcoholism (Grant Number F32AA015876)
– Army Research Office (Contract W911NF-04-1-0447)
– Army Research Laboratory (Contract W911NF-07-1-0059)
– Isaac Newton Institute
Contact Information
Yasmin H. SaidDepartment of Computational and Data Sciences
Email: [email protected]
Phone: 301-538-7478
The length of this document defends it well against the risk of its being read. -Winston Churchill