Preserving Semantic Content in Text Mining Using Multigrams Yasmin H. Said Department of Computational and Data Sciences George Mason University QMDNS

Preserving Semantic Content in Text Mining Using Multigrams

Yasmin H. Said

Department of Computational and Data SciencesGeorge Mason University

QMDNS 2010 - May 26, 2010This is joint work with Edward J. Wegman

Outline• Background on Text Mining• Bigrams

– Term-Document and Bigram-Document Matrices

– Term-Term and Document-Document Associations

• Example using 15,863 Documents

To read between the lines is easier than to follow the text. -Henry James

Text Data Mining

• Synthesis of …– Information Retrieval

• Focuses on retrieving documents from a fixed database

• May be multimedia including text, images, video, audio

– Natural Language Processing• Usually more challenging

questions• Bag-of-words methods• Vector space models

– Statistical Data Mining• Pattern recognition, classification,

clustering

Natural Language Processing

• Key elements are:– Morphology (grammar of word forms)– Syntax (grammar of word

combinations to form sentences)– Semantics (meaning of word or

sentence)– Lexicon (vocabulary or set of words)

• Time flies like an arrow– Time passes speedily like an arrow

passes speedily or– Measure the speed of a fly like you

would measure the speed of an arrow

• Ambiguity of nouns and verbs• Ambiguity of meaning

Text Mining Tasks

• Text Classification– Assigning a document to one of several

pre-specified classes• Text Clustering

– Unsupervised learning• Text Summarization

– Extracting a summary for a document– Based on syntax and semantics

• Author Identification/Determination– Based on stylistics, syntax, and semantics

• Automatic Translation– Based on morphology, syntax, semantics,

and lexicon• Cross Corpus Discovery

– Also known as Literature Based Discovery

Preprocessing

• Denoising– Means removing stopper words

… words with little semantic meaning such as the, an, and, of, by, that and so on.

– Stopper words may be context dependent, e.g. Theorem and Proof in a mathematics document

• Stemming– Means removal suffixes, prefixes

and infixes to root– An example: wake, waking,

awake, woke wake

Bigrams and Trigrams

• A bigram is a word pair where the order of words is preserved.– The first word is the reference

word.– The second is the neighbor

word.

• A trigram is a word triple where order is preserved.

• Bigrams and trigrams are useful because they can capture semantic content.

Example

• Hell hath no fury like a woman scorned.

• Denoised: Hell hath no fury like woman scorned.

• Stemmed: Hell has no fury like woman scorn.

• Bigrams:– Hell has, has no, no fury, fury like,

like woman, woman scorn, scorn .– Note that the “.” (any sentence

ending punctuation) is treated as a word

Bigram Proximity Matrix

. fury has hell like noscor

n

woman

.

fury 1

has 1

hell 1

like 1

no 1

scorn

1

woman

1

Bigram Proximity Matrix

• The bigram proximity matrix (BPM) is computed for an entire document– Entries in the matrix may be

either binary or a frequency count

• The BPM is a mathematical representation of a document with some claim to capturing semantics– Because bigrams capture noun-

verb, adjective-noun, verb-adverb, verb-subject structures

– Martinez (2002)

Vector Space Methods

• The classic structure in vector space text mining methods is a term-document matrix where– Rows correspond to terms, columns

correspond to documents, and– Entries may be binary or frequency

counts

• A simple and obvious generalization is a bigram-document matrix where– Rows correspond to bigrams,

columns to documents, and again entries are either binary or frequency counts

Example Data• The text data were collected by the

Linguistic Data Consortium in 1997 and were originally used in Martinez (2002)– The data consisted of 15,863 news

reports collected from Reuters and CNN from July 1, 1994 to June 30, 1995

– The full lexicon for the text database included 68,354 distinct words

• In all 313 stopper words are removed• after denoising and stemming, there

remain 45,021 words in the lexicon– The example that I report here is

based on the full set of 15,863 documents. This is the same basic data set that Dr. Wegman reported on in his keynote talk although he considered a subset of 503 documents.

Vector Space Methods

• A document corpus we have worked with has 45,021 denoised and stemmed entries in its lexicon and 1,834,123 bigrams– Thus the TDM is 45,021 by

15,863 and the BDM is 1,834,123 by 15,863

– The term vector is 45,021 dimensional and the bigram vector is 1,834,123 dimensional

– The BPM for each document is 1,834,123 by 1,834,123 and, of course, very sparse.

Term-Document Matrix Analysis

Zipf’s Law

Term-Document Matrix Analysis

sai

peopl

on

north

go

said

here

two

outjust

cnntime

well

know

korea

report

getright

state

think

newoffici

korean

see

area

hous

dai

like

todai

presid

hourlook

white

citi

year

first

hall

build

still

talktake

happen

unit

be

comet

fire

want

morn

last

bomb

back

work

thank

polic

jupit

thing

kim

releas

defens

through

south

clinic

try

offic

live

told

american

good

call

make wai

pilot

join

continu

govern

case

investig

even

tell

point

mile

ago

week

feder

question

plane

john

militari

air

lot

believ

helphelicopt

simpson

hit

crash

astronom

clinton

scene

test

someth

secur

rescu

far

nation

oklahoma

hear

kind

possibl

kobe

il

expect

abl

hope

man

death

explos

inform

us

world

realli

forc

kill

number

japan

home

evid

issu

serb

fact

ask

train

situat

cours

partanyth

seen

court

come

earthquak

shot

made

incid

latest

famili

deal

countri

search

few

remainlittl

place

center

chief

yesterdai

sever

close

show

author

oper

salvi

impact

find

bloodjudg

abort

pictur

involv

attornei

earth

whether

meet

came

nuclear

charg

side

person

start

iraq

long

mean

space

problem

second

need

return

depart

york

went

night

earlier

mr

water

bosnian fragment

suspect

later

stori

ground

bobbi

appar

concern

give

bodi

left

seem

heard

shoot

leaderawai

washington

hand

sure

servic

big

found

sawdamag

injur

planet

mission

safe

car

minut

indic

flood

turn

dna

iraqi

gener

prosecut

bosnia

wordmove

against

least

respons

appear

effort

month

perhap

feelsort

five

flight

confirm

open

nato

massachusett

quak

telescop

causprobabl

fbi

done

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27 28

29

3031

3233

3435

36

37

38

3940

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

7273

74

75

7677

78

79 80 81

8283

84

85

8687

88

89

9091

92

9394

9596

97

98

99

100

101102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123124

125

126127

128

129

130

131

132

133

134

135

136

137

138

139

140141

142

143

144

145 146

147

148149

150

151

152

153

154

155

156

157

158159

160

161

162

163

164

165 166

167

168

169

170

171

172

173174

175

176

177

178179180

181 182

183

184

185186

187

188

189

190

191

192

193

194

195196197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214215

216

217

218

219

220

221222

223

224

225226

227

228

229

230

231232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277278

279

280

281

282

283

284

285

286

287

288289290

291

292 293

294

295

296

297

298

299

300301

302

303304

305

306

307

308

309

310

311312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368369

370

371

372373

374

375

376

377

378 379

380

381

382

383

384

385

386

387

388

389

390

391

392393

394

395

396

397

398

399

400

401

402

403

404

405

406407

408

409

410

411

412

413

414

415

416

417418

419

420

421

422

423

424 425

426

427

428

429

430

431

432

433

434

435436

437

438

439

440

441

442

443

444

445446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490491

492493

494495

496

497498499

500

501

502

503

Text Example - Clusters• A portion of the hierarchical agglomerative tree for the clusters

Text Example - Clusters

Cluster 0, Size: 157, ISim: 0.142, ESim: 0.008

Descriptive: ireland 12.2%, ira 9.1%, northern.ireland 7.6%, irish 5.5%, fein 5.0%, sinn 5.0%, sinn.fein 5.0%, northern 3.2%, british 3.2%, adam 2.4%

Discriminating: ireland 7.7%, ira 5.9%, northern.ireland 4.9%, irish 3.5%, fein 3.2%, sinn 3.2%, sinn.fein 3.2%, northern 1.6%, british 1.5%, adam 1.5%

Phrases 1: ireland 121, northern 119, british 116, irish 111, ira 110, peac 107, minist 104, govern 104, polit 104, talk 102

Phrases 2: northern.ireland 115, sinn.fein 95, irish.republican 94, republican.armi 91, ceas.fire 87, polit.wing 76, prime.minist 71, peac.process 66, gerri.adam 59, british.govern 50

Phrases 3: irish.republican.armi 91, prime.minist.john 47, minist.john.major 43, ira.ceas.fire 35, ira.polit.wing 34, british.prime.minist 34, sinn.fein.leader 30, rule.northern.ireland 27, british.rule.northern 27, declar.ceas.fire 26



Descriptive: korea 19.8%, north 13.2%, korean 11.2%, north.korea 10.8%, kim 5.8%, north.korean 3.7%, nuclear 3.5%, pyongyang 2.0%, south 1.9%, south.korea 1.5%

Discriminating: korea 12.7%, north 7.4%, korean 7.2%, north.korea 7.0%, kim 3.8%, north.korean 2.4%, nuclear 1.7%, pyongyang 1.3%, south.korea 1.0%, simpson 0.8%

Phrases 1: korea 305, north 303, korean 285, south 243, unit 215, nuclear 204, offici 196, pyongyang 179, presid 167, talk 165

Phrases 2: north.korea 291, north.korean 233, south.korea 204, south.korean 147, kim.sung 108, presid.kim 83, nuclear.program 79, kim.jong 74, light.water 71, presid.clinton 69

Phrases 3: light.water.reactor 56, unit.north.korea 55, north.korea.nuclear 53, chief.warrant.offic 49, presid.kim.sung 46, leader.kim.sung 39, presid.kim.sam 37, north.korean.offici 36, warrant.offic.bobbi 35, bobbi.wayn.hall 29



Descriptive: school 2.2%, film 1.3%, children 1.2%, student 1.0%, percent 0.8%, compani 0.7%, kid 0.7%, peopl 0.7%, movi 0.7%, music 0.6%

Discriminating: school 2.3%, simpson 1.8%, film 1.7%, student 1.1%, presid 1.0%, serb 0.9%, children 0.8%, clinton 0.8%, movi 0.8%, music 0.8%

Phrases 1: cnn 1034, peopl 920, time 893, report 807, don 680, dai 650, look 630, call 588, live 535, lot 498

Phrases 2: littl.bit 99, lot.peopl 90, lo.angel 85, world.war 71, thank.join 67, million.dollar 60, 000.peopl 54, york.citi 50, garsten.cnn 48, san.francisco 47

Phrases 3: jeann.moo.cnn 41, cnn.entertain.new 36, cnn.jeann.moo 32, norma.quarl.cnn 30, cnn.norma.quarl 28, cnn.jeff.flock 28, jeff.flock.cnn 27, brian.cabel.cnn 26, pope.john.paul 25, lisa.price.cnn 25

BigramsBigrams

Cluster 1

Cluster Size Distribution

Document by Cluster Plot

Cluster Identities

• Cluster 02: Comet Shoemaker Levy Crashing into Jupiter.• Cluster 08: Oklahoma City Bombing.• Cluster 11: Bosnian-Serb Conflict.• Cluster 12: Court-Law, O.J. Simpson Case.• Cluster 15: Cessna Plane Crashed onto South Lawn White House.• Cluster 19: American Army Helicopter Emergency Landing in North

Korea.• Cluster 24: Death of North Korean Leader (Kim il Sung) and North

Korea’s Nuclear Ambitions.• Cluster 26: Shootings at Abortion Clinics in Boston.• Cluster 28: Two Americans Detained in Iraq. • Cluster 30: Earthquake that Hit Japan.

Bigram-Document Matrix for 50 Documents

Bigram-Bigram Matrix for 50 Documents

Bigram-Bigram Matrix Using the Top 253 Bigrams

Closing Remarks

• Text mining presents great challenges, but is amenable to statistical/mathematical approaches– Text mining using vector space

methods challenges both the mathematical and visualization issues

• especially in terms of dimensionality, sparsity, and scalability.

Acknowledgments

• Dr. Angel Martinez• Dr. Jeff Solka and Avory Bryant• Dr. Walid Sharabati• Funding Sources

– National Institute on Alcohol Abuse and Alcoholism (Grant Number F32AA015876)

– Army Research Office (Contract W911NF-04-1-0447)

– Army Research Laboratory (Contract W911NF-07-1-0059)

– Isaac Newton Institute

Contact Information

Yasmin H. SaidDepartment of Computational and Data Sciences

Email: [email protected]

Phone: 301-538-7478

The length of this document defends it well against the risk of its being read. -Winston Churchill

mailto:[email protected]