Document modeling 11 - University of Pittsburghpeterb/2480-012/DocumentModeling.pdf · Web Document Modeling ... ways to cut its bills for reree health care 11

1/30/11

1

WebDocumentModeling

PeterBrusilovsky

WithslidesfromJae‐wookAhnandJumpolPolvichai

IntroducEon

•  Modelingmeans“theconstruc+onofanabstractrepresenta+onofthedocument”– UsefulforallapplicaEonsaimedatprocessinginformaEonautomaEcally.

•  Whybuildmodelsofdocuments?– Toguidetheuserstotherightdocumentsweneedtoknowwhattheyareabout,whatistheirstructure

– SomeadaptaEontechniquescanoperatewithdocumentsas“blackboxes”,butothersarebasedontheabilitytounderstandandmodeldocuments

1/30/11

2

Documentmodeling

DocumentsDocumentsDocuments

DocumentModels‐ Bagofwords‐ Tag‐based‐Link‐based

‐Concept‐based‐ AI‐based

ProcessingApplica+on

‐ Matching(IR)‐ Filtering‐ AdapEve

presentaEon,etc.

3

DocumentmodelExample

thedeathtollrisesinthemiddleeastastheworstviolenceinfouryearsspreadsbeyondjerusalem.thestakesarehigh,theraceisEght.preppingforwhatcouldbeadecisivemomentinthepresidenEalbaYle.howaEnytowniniowabecameaboomingmelEngpotandtheimagethatwillnotsoonfade.themanwhocapturedittellsthestorybehindit.

4

1/30/11

3

Outline

•  ClassicIRbasedrepresenta+on–  Preprocessing–  Boolean,ProbabilisEc,VectorSpacemodels

•  Web‐IRdocumentrepresenta+on–  Tagbaseddocumentmodels–  Linkbaseddocumentmodels–HITS,GoogleRank

•  Concept‐baseddocumentmodeling–  LSI

•  AI‐baseddocumentrepresenta+on– ANN,SemanEcNetwork,BayesianNetwork

5

MarkupLanguages AMarkupLanguageisatext‐basedlanguagethatcombinescontentwith

itsmetadata.MLsupportstructuremodeling•  Presenta+onalMarkup

–  ExpressdocumentstructureviathevisualappearanceofthewholetextofaparEcularfragment.

–  Exp.Wordprocessor

•  ProceduralMarkup–  FocusesonthepresentaEonoftext,butisusuallyvisibletotheuser

ediEngthetextfile,andisexpectedtobeinterpretedbysocwarefollowingthesameproceduralorderinwhichitappears.

–  Exp.Tex,PostScript•  Descrip+veMarkup

–  ApplieslabelstofragmentsoftextwithoutnecessarilymandaEnganyparEculardisplayorotherprocessingsemanEcs.

–  Exp.SGML,XML

1/30/11

4

ClassicIRmodelProcess

DocumentsDocumentsDocuments

SetofTerms“BagofWords”

Preprocessing

TermweighEng

Query(ordocuments)

Matching(byIRmodels)

7

PreprocessingMoEvaEon

•  Extractdocumentcontentitselftobeprocessed(used)

•  RemovecontrolinformaEon– Tags,script,stylesheet,etc

•  Removenon‐informaEvefragments– Stopwords,word‐stems

8

1/30/11

5

PreprocessingHTMLtagremoval

•  Removes<.*>partsfromtheHTMLdocument(source)

9

PreprocessingHTMLtagremoval

DETROIT—Withitsaccesstoagovernmentlifelineinthebalance,GeneralMotorswaslockedinintensenegoEaEonsonMondaywiththeUnitedAutomobileWorkersoverwaystocutitsbillsforreEreehealthcare.

10

1/30/11

6

PreprocessingTokenizing/casenormalizaEon

•  Extractterm/featuretokensfromthetext

detroitwithitsaccesstoagovernmentlifelineinthebalancegeneralmotorswaslockedinintensenegoEaEonsonmondaywiththeunitedautomobileworkersoverwaystocutitsbillsforreEreehealthcare

11

PreprocessingStopwordremoval

•  Verycommonwords

•  Donotcontributetoseparateadocumentfromanothermeaningfully

•  Usuallyastandardsetofwordsarematched/removed

detroitwithitsaccesstoagovernmentlifelineinthebalancegeneralmotorswaslockedinintensenegoEaEonsonmondaywiththeunitedautomobileworkersoverwaystocutitsbillsforreEreehealthcare

12

1/30/11

7

PreprocessingStemming

•  Extractsword“stems”only•  AvoidwordvariaEonthatarenotinformaEve– apples,apple – Retrieval,retrieve,retrieving– Should they be dis.nguished? Maybe not.

•  Porter•  Krovetz

13

PreprocessingStemming(Porter)

•  MarEnPorter,1979

•  CyclicalrecogniEonandremovalofknownsuffixesandprefixes

14

1/30/11

8

PreprocessingStemming(Krovetz)

•  BobKorvetz,1993•  MakesuseofinflecEonallinguisEcmorphology

•  RemovesinflecEonalsuffixesinthreesteps– singleform(e.g.‘‐ies’,‘‐es’,‘‐s’)– pasttopresenttense(e.g.‘‐ed’)–  removalof‘‐ing’

•  CheckinginadicEonary•  Morehuman‐readable

15

PreprocessingStemming

•  Porterstemmingexmple[detroitaccessgovernlifelinbalancgenermotorlockintensnegoEmondaiunitautomobilworkerwaicutbillreErehealthcare]

•  Krovetzstemmingexample[detroitaccessgovernmentlifelinebalancegeneralmotorlockintensenegoEaEonmondayunitedautomobileworkerwayscutbillreEreehealthcare]

16

1/30/11

9

Termweigh+ng

•  Howshouldwerepresenttheterms/featuresacertheprocessessofar?

17

detroitaccessgovernmentlifeline

balancegeneralmotorlockintensenegoEaEon

mondayunitedautomobileworkerwayscutbillreEreehealthcare

?

Termweigh+ngDocument‐termmatrix

•  Columns–everytermappearedinthecorpus(notasingledocument)

•  Rows–everydocumentinthecollecEon

•  Example–  IfacollecEonhasNdocumentsandMterms…

18

T1 T2 T3 … TM

Doc1 0 1 1 … 0

Doc2 1 0 0 … 1

… … … … … …

DocN 0 0 0 … 1

1/30/11

10

Termweigh+ngDocument‐termmatrix

•  Document‐termmatrix– Binary(ifappears1,otherwise0)

•  Everytermistreatedequivalently

aGract benefit book … zoo

Doc1 0 1 1 … 0

Doc2 1 0 0 … 1

Doc3 0 0 0 … 1

19

Termweigh+ngTermfrequency

•  So,weneed“weigh&ng”– Givedifferent“importance” todifferentterms

•  TF– Termfrequency– HowmanyEmesatermappearedinadocument?– Higherfrequencyhigherrelatedness

20

1/30/11

11

Termweigh+ngTermfrequency

21

Termweigh+ngIDF

•  IDF–  InverseDocumentFrequency

– Generalityofatermtoogeneral,notbeneficial– Example•  “Informa+on”(inACMDigitalLibrary)

–  99.99%ofarEcleswillhaveit–  TFwillbeveryhighineachdocument

•  “Personaliza+on”–  Say,5%ofdocumentswillhaveit

–  TFagainwillbeveryhigh

22

1/30/11

12

Termweigh+ngIDF

•  IDF

– DF=documentfrequency=numberofdocumentsthathavetheterm

–  IfACMDLhas1Mdocuments•  IDF(“informaEon”)=Log(1M/999900)=0.0001

•  IDF(“persoanlizaEon”)=Log(1M/50000)=30.63

€

log( NDF

)

23

Termweigh+ngIDF

•  informa+on,personaliza+on,recommenda+on•  Canwesay…

–  Doc1,2,3…areaboutinformaEon?

–  Doc1,6,8…areaboutpersonalizaEon?–  Doc5isaboutrecommendaEon?

Doc1 Doc2 Doc3 Doc4

Doc5 Doc6 Doc7 Doc8

24

1/30/11

13

Termweigh+ngTF*IDF

•  TF*IDF– TFmulEpliedbyIDF

– ConsidersTFandIDFatthesameEme– HighfrequencytermsfocusedinsmallerporEonofdocumentsgetshigherscores

25

Document benef aGract sav springer book

d1 0.176 0.176 0.417 0.176 0.176

d2 0.000 0.350 0.000 0.528 0.000

d3 0.528 0.000 0.000 0.000 0.176

Termweigh+ngBM25

•  OKAPIBM25– OkapiBestMatch25

– ProbabilisEcmodel–calculatestermrelevancewithinadocument

– Computesatermweightaccordingtotheprobabilityofitsappearanceinarelevantdocumentandtotheprobabilityofitappearinginanon‐relevantdocumentinacollecEonD

26

1/30/11

14

Termweigh+ngEntropyweighEng

•  EntropyweighEng

– Entropyoftermti•  ‐1:equaldistribuEonofalldocuments•  0:appearingonly1document

27

IRmodels

•  Boolean•  ProbabilisEc•  VectorSpace

28

1/30/11

15

IRModelsBooleanmodel

•  BasedonsettheoryandBooleanalgebra

•  d1,d3

29

IRModelsBooleanmodel

•  Simpleandeasytoimplement•  Shortcomings– Onlyretrievesexactmatches•  NoparEalmatch

– Noranking– DependsonuserqueryformulaEon

30

1/30/11

16

IRModelsProbabilisEcmodel

•  Binaryweightvector•  Query‐documentsimilarityfuncEon

•  Probabilitythatacertaindocumentisrelevanttoacertainquery

•  Ranking–accordingtotheprobabilitytoberelevant

31


•  SimilaritycalculaEon

•  SimplifyingassumpEons– NorelevanceinformaEonatstartup

32

BayesTheoremandremovingsomeconstants

1/30/11

17


•  Shortcomings– Divisionofthesetofdocumentsintorelevant/non‐relevantdocuments

– TermindependenceassumpEon

–  Indexterms–binaryweights

33

IRModelsVectorspacemodel

•  Document=mdimensionalspace(m=indexterms)

•  Eachtermrepresentsadimension

•  ComponentofadocumentvectoralongagivendirecEontermimportance

•  Queryanddocumentsarerepresentedasvectors

34

1/30/11

18


•  Documentsimilarity– Cosineangle

•  Benefits– TermweighEng– ParEalmatching

•  Shortcomings– Termindependency

35

t1

t2

d1

d2

Cosineangle


•  Example–  Query=“springerbook”–  q=(0,0,0,1,1)–  Sim(d1,q)=(0.176+0.176)/(√1+√(0.1762+0.1762+0.4172+0.1762+0.1762)=0.228

–  Sim(d2,q)=(0.528)/(√1+√(0.3502+0.5282))=0.323

–  Sim(d3,q)=(0.176)/(√1+√(0.5282+0.1762))=0.113


d1 0.176 0.176 0.417 0.176 0.176

d2 0.000 0.350 0.000 0.528 0.000

d3 0.528 0.000 0.000 0.000 0.176

36

1/30/11

19


•  Document–documentsimilarity•  Sim(d1,d2)=0.447

•  Sim(d2,d3)=0.0

•  Sim(d1,d3)=0.408


d1 0.176 0.176 0.417 0.176 0.176

d2 0.000 0.350 0.000 0.528 0.000

d3 0.528 0.000 0.000 0.000 0.176

37

Curseofdimensionality

•  TDT4– |D|=96,260– |ITD|=118,205

•  Iflinearlycalculatessim(q,D)– 96,260(pereachdocument)*118,205(innerproduct)comparisons

•  However,documentmatricesareverysparse– Mostly0’s– Space,calculaEoninefficienttostorethose0’s

38

1/30/11

20

Curseofdimensionality

•  Invertedindex–  Indexfromtermtodocument

39

Web‐IRdocumentrepresenta+on

•  EnhancestheclassicVSM•  PossibiliEesofferedbyHTMLlanguages

•  Tag‐based•  Link‐based– HITS– PageRank

40

1/30/11

21

Web‐IRTag‐basedapproaches

•  Givedifferentweightstodifferenttags– Sometextfragmentswithinatagmaybemoreimportantthanothers

– <body>,<Etle>,<h1>,<h2>,<h3>,<a>…

41


•  WEBORsystem•  Sixclassesoftags

•  CIV=classimportancevector•  TFV=classfrequencyvector

42

1/30/11

22


•  TermweighEngexample– CIV={0.6,1.0,0.8,0.5,0.7,0.8,0.5}– TFV(“personalizaEon”)={0,3,3,0,0,8,10}– W(“personalizaEon”)=(0.0+3.0+2.4+0.0+0.0+6.4+5.0)*IDF

43

Web‐IRHITS(Hyperlink‐InducedTopicSearch)

•  Link‐basedapproach•  PromotesearchperformancebyconsideringWebdocumentlinks

•  WorksonaniniEalsetofretrieveddocuments•  HubandauthoriEes– Agoodauthoritypageisonethatispointedtobymanygoodhubpages

– Agoodhubpageisonethatispointedtobymanygoodauthoritypages

–  Circular defini&on iteraEvecomputaEon

44

1/30/11

23


•  IteraEveupdateofauthority&hubvectors

45


46

1/30/11

24


•  N=1– A=[0.3710.5570.743]– H=[0.6670.6670.333]

•  N=10– A=[0.3440.5730.744]– H=[0.7220.6190.309]

•  N=1000– A=[0.3280.5910.737]– H=[0.7370.5910.328]

47

Doc0

Doc2Doc1

Web‐IRPageRank

•  Google•  UnlikeHITS

–  NotlimitedtoaspecificiniEalretrievedsetofdocuments

–  Singlevalue

•  IniEalstate=nolink•  Evenlydividescoresto4

documents•  PR(A)=PR(B)=PR(C)=

PRD(D)=1/4=0.25

48

A

B

C

D

1/30/11

25

Web‐IRPageRank

•  PR(A)=PR(B)+PR(C)+PR(D)=0.25+0.25+0.25=0.75

49

A

B

C

D

Web‐IRPageRank

•  PR(B)toA=0.25/2=0.125•  PR(B)toC=0.25/2=0.125•  PR(A)

=PR(B)+PR(C)+PR(D)=0.125+0.25+0.25=0.625

50

A

B

C

D

1/30/11

26

Web‐IRPageRank

•  PR(D)toA=0.25/3=0.083•  PR(D)toB=0.25/3=0.083•  PR(D)toC=0.25/3=0.083•  PR(A)=PR(B)+PR(C)+PR(D)=

0.125+0.25+0.083=0.458

•  RecursivelykeepcalculaEngtofurtherdocumentslinkingtoA,B,C,andD

51

A

B

C

D

Concept‐baseddocumentmodelingLSI(LatentSemanEcIndexing)

•  Representsdocumentsbyconcepts– Notbyterms

•  Reducetermspaceconceptspace– Linearalgebratechnique:SVD(SingularValueDecomposiEon)

•  Step(1):MatrixdecomposiEon–originaldocumentmatrixAisfactoredintothreematrices

52

1/30/11

27

Concept‐baseddocumentmodelingLSI

•  Step(2):ArankkisselectedfromtheoriginalequaEon(k=reduced#ofconceptspace)

•  Step(3):Theoriginalterm‐documentmatrixAisconvertedtoAk

53


•  Document‐termmatrixA

54

1/30/11

28


•  DecomposiEon

55


•  LowrankapproximaEon(k=2)

56

1/30/11

29


•  FinalAk– Columns:documents

– Rows:concepts(k=2)

57

‐1

‐0.8

‐0.6

‐0.4

‐0.2

0

0.2

0.4

0.6

‐0.9 ‐0.8 ‐0.7 ‐0.6 ‐0.5 ‐0.4 ‐0.3 ‐0.2 ‐0.1 0

VT=SVDDocumentMatrix

d2

d3

d1

AI‐basedapproachesArEficialNeuralNetworks

•  Query,term,documentsseparatedinto3layers

•  Term‐documentweight=norm.TF‐IDF

•  QuerytermacEvaEonsumofthesignalsexceedsathresholddocumentretrieval

58

1/30/11

30

AI‐basedapproachesSemanEcNetworks

•  Conceptualknowledge

•  RelaEonshipbetweenconcepts

59

AI‐basedapproachesBayesianNetworks

60

•  MetzlerandCroc(2004)–  IndrisearchenginebasedonInQuery

•  Inferencenetwork– Document–  RepresentaEon(term,phrases)

– Query–  InformaEonNeed

•  Calculatesprobabilityofeachdocumentfromthenetwork