30
1/30/11 1 Web Document Modeling Peter Brusilovsky With slides from Jae‐wook Ahn and Jumpol Polvichai IntroducEon Modeling means “the construc+on of an abstract representa+on of the documentUseful for all applicaEons aimed at processing informaEon automaEcally. Why build models of documents? To guide the users to the right documents we need to know what they are about, what is their structure Some adaptaEon techniques can operate with documents as “black boxes”, but others are based on the ability to understand and model documents

Document modeling 11 - University of Pittsburghpeterb/2480-012/DocumentModeling.pdf · Web Document Modeling ... ways to cut its bills for reree health care 11

  • Upload
    doduong

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

1/30/11

1

WebDocumentModeling

PeterBrusilovsky

WithslidesfromJae‐wookAhnandJumpolPolvichai

IntroducEon

•  Modelingmeans“theconstruc+onofanabstractrepresenta+onofthedocument”– UsefulforallapplicaEonsaimedatprocessinginformaEonautomaEcally.

•  Whybuildmodelsofdocuments?– Toguidetheuserstotherightdocumentsweneedtoknowwhattheyareabout,whatistheirstructure

– SomeadaptaEontechniquescanoperatewithdocumentsas“blackboxes”,butothersarebasedontheabilitytounderstandandmodeldocuments

1/30/11

2

Documentmodeling

DocumentsDocumentsDocuments

DocumentModels‐ Bagofwords‐ Tag‐based‐Link‐based

‐Concept‐based‐ AI‐based

ProcessingApplica+on

‐ Matching(IR)‐ Filtering‐ AdapEve

presentaEon,etc.

3

DocumentmodelExample

thedeathtollrisesinthemiddleeastastheworstviolenceinfouryearsspreadsbeyondjerusalem.thestakesarehigh,theraceisEght.preppingforwhatcouldbeadecisivemomentinthepresidenEalbaYle.howaEnytowniniowabecameaboomingmelEngpotandtheimagethatwillnotsoonfade.themanwhocapturedittellsthestorybehindit.

4

1/30/11

3

Outline

•  ClassicIRbasedrepresenta+on–  Preprocessing–  Boolean,ProbabilisEc,VectorSpacemodels

•  Web‐IRdocumentrepresenta+on–  Tagbaseddocumentmodels–  Linkbaseddocumentmodels–HITS,GoogleRank

•  Concept‐baseddocumentmodeling–  LSI

•  AI‐baseddocumentrepresenta+on– ANN,SemanEcNetwork,BayesianNetwork

5

MarkupLanguages AMarkupLanguageisatext‐basedlanguagethatcombinescontentwith

itsmetadata.MLsupportstructuremodeling•  Presenta+onalMarkup

–  ExpressdocumentstructureviathevisualappearanceofthewholetextofaparEcularfragment.

–  Exp.Wordprocessor

•  ProceduralMarkup–  FocusesonthepresentaEonoftext,butisusuallyvisibletotheuser

ediEngthetextfile,andisexpectedtobeinterpretedbysocwarefollowingthesameproceduralorderinwhichitappears.

–  Exp.Tex,PostScript•  Descrip+veMarkup

–  ApplieslabelstofragmentsoftextwithoutnecessarilymandaEnganyparEculardisplayorotherprocessingsemanEcs.

–  Exp.SGML,XML

1/30/11

4

ClassicIRmodelProcess

DocumentsDocumentsDocuments

SetofTerms“BagofWords”

Preprocessing

TermweighEng

Query(ordocuments)

Matching(byIRmodels)

7

PreprocessingMoEvaEon

•  Extractdocumentcontentitselftobeprocessed(used)

•  RemovecontrolinformaEon– Tags,script,stylesheet,etc

•  Removenon‐informaEvefragments– Stopwords,word‐stems

8

1/30/11

5

PreprocessingHTMLtagremoval

•  Removes<.*>partsfromtheHTMLdocument(source)

9

PreprocessingHTMLtagremoval

DETROIT—Withitsaccesstoagovernmentlifelineinthebalance,GeneralMotorswaslockedinintensenegoEaEonsonMondaywiththeUnitedAutomobileWorkersoverwaystocutitsbillsforreEreehealthcare.

10

1/30/11

6

PreprocessingTokenizing/casenormalizaEon

•  Extractterm/featuretokensfromthetext

detroitwithitsaccesstoagovernmentlifelineinthebalancegeneralmotorswaslockedinintensenegoEaEonsonmondaywiththeunitedautomobileworkersoverwaystocutitsbillsforreEreehealthcare

11

PreprocessingStopwordremoval

•  Verycommonwords

•  Donotcontributetoseparateadocumentfromanothermeaningfully

•  Usuallyastandardsetofwordsarematched/removed

detroitwithitsaccesstoagovernmentlifelineinthebalancegeneralmotorswaslockedinintensenegoEaEonsonmondaywiththeunitedautomobileworkersoverwaystocutitsbillsforreEreehealthcare

12

1/30/11

7

PreprocessingStemming

•  Extractsword“stems”only•  AvoidwordvariaEonthatarenotinformaEve– apples,apple – Retrieval,retrieve,retrieving– Should they be dis.nguished? Maybe not. 

•  Porter•  Krovetz

13

PreprocessingStemming(Porter)

•  MarEnPorter,1979

•  CyclicalrecogniEonandremovalofknownsuffixesandprefixes

14

1/30/11

8

PreprocessingStemming(Krovetz)

•  BobKorvetz,1993•  MakesuseofinflecEonallinguisEcmorphology

•  RemovesinflecEonalsuffixesinthreesteps– singleform(e.g.‘‐ies’,‘‐es’,‘‐s’)– pasttopresenttense(e.g.‘‐ed’)–  removalof‘‐ing’

•  CheckinginadicEonary•  Morehuman‐readable

15

PreprocessingStemming

•  Porterstemmingexmple[detroitaccessgovernlifelinbalancgenermotorlockintensnegoEmondaiunitautomobilworkerwaicutbillreErehealthcare]

•  Krovetzstemmingexample[detroitaccessgovernmentlifelinebalancegeneralmotorlockintensenegoEaEonmondayunitedautomobileworkerwayscutbillreEreehealthcare]

16

1/30/11

9

Termweigh+ng

•  Howshouldwerepresenttheterms/featuresacertheprocessessofar?

17

detroitaccessgovernmentlifeline

balancegeneralmotorlockintensenegoEaEon

mondayunitedautomobileworkerwayscutbillreEreehealthcare

?

Termweigh+ngDocument‐termmatrix

•  Columns–everytermappearedinthecorpus(notasingledocument)

•  Rows–everydocumentinthecollecEon

•  Example–  IfacollecEonhasNdocumentsandMterms…

18

T1 T2 T3 … TM

Doc1 0 1 1 … 0

Doc2 1 0 0 … 1

… … … … … …

DocN 0 0 0 … 1

1/30/11

10

Termweigh+ngDocument‐termmatrix

•  Document‐termmatrix– Binary(ifappears1,otherwise0)

•  Everytermistreatedequivalently

aGract benefit book … zoo

Doc1 0 1 1 … 0

Doc2 1 0 0 … 1

Doc3 0 0 0 … 1

19

Termweigh+ngTermfrequency

•  So,weneed“weigh&ng”– Givedifferent“importance” todifferentterms

•  TF– Termfrequency– HowmanyEmesatermappearedinadocument?– Higherfrequencyhigherrelatedness

20

1/30/11

11

Termweigh+ngTermfrequency

21

Termweigh+ngIDF

•  IDF–  InverseDocumentFrequency

– Generalityofatermtoogeneral,notbeneficial– Example•  “Informa+on”(inACMDigitalLibrary)

–  99.99%ofarEcleswillhaveit–  TFwillbeveryhighineachdocument

•  “Personaliza+on”–  Say,5%ofdocumentswillhaveit

–  TFagainwillbeveryhigh

22

1/30/11

12

Termweigh+ngIDF

•  IDF

– DF=documentfrequency=numberofdocumentsthathavetheterm

–  IfACMDLhas1Mdocuments•  IDF(“informaEon”)=Log(1M/999900)=0.0001

•  IDF(“persoanlizaEon”)=Log(1M/50000)=30.63

log( NDF

)

23

Termweigh+ngIDF

•  informa+on,personaliza+on,recommenda+on•  Canwesay…

–  Doc1,2,3…areaboutinformaEon?

–  Doc1,6,8…areaboutpersonalizaEon?–  Doc5isaboutrecommendaEon?

Doc1 Doc2 Doc3 Doc4

Doc5 Doc6 Doc7 Doc8

24

1/30/11

13

Termweigh+ngTF*IDF

•  TF*IDF– TFmulEpliedbyIDF

– ConsidersTFandIDFatthesameEme– HighfrequencytermsfocusedinsmallerporEonofdocumentsgetshigherscores

25

Document benef aGract sav springer book

d1 0.176 0.176 0.417 0.176 0.176

d2 0.000 0.350 0.000 0.528 0.000

d3 0.528 0.000 0.000 0.000 0.176

Termweigh+ngBM25

•  OKAPIBM25– OkapiBestMatch25

– ProbabilisEcmodel–calculatestermrelevancewithinadocument

– Computesatermweightaccordingtotheprobabilityofitsappearanceinarelevantdocumentandtotheprobabilityofitappearinginanon‐relevantdocumentinacollecEonD

26

1/30/11

14

Termweigh+ngEntropyweighEng

•  EntropyweighEng

– Entropyoftermti•  ‐1:equaldistribuEonofalldocuments•  0:appearingonly1document

27

IRmodels

•  Boolean•  ProbabilisEc•  VectorSpace

28

1/30/11

15

IRModelsBooleanmodel

•  BasedonsettheoryandBooleanalgebra

•  d1,d3

29

IRModelsBooleanmodel

•  Simpleandeasytoimplement•  Shortcomings– Onlyretrievesexactmatches•  NoparEalmatch

– Noranking– DependsonuserqueryformulaEon

30

1/30/11

16

IRModelsProbabilisEcmodel

•  Binaryweightvector•  Query‐documentsimilarityfuncEon

•  Probabilitythatacertaindocumentisrelevanttoacertainquery

•  Ranking–accordingtotheprobabilitytoberelevant

31

IRModelsProbabilisEcmodel

•  SimilaritycalculaEon

•  SimplifyingassumpEons– NorelevanceinformaEonatstartup

32

BayesTheoremandremovingsomeconstants

1/30/11

17

IRModelsProbabilisEcmodel

•  Shortcomings– Divisionofthesetofdocumentsintorelevant/non‐relevantdocuments

– TermindependenceassumpEon

–  Indexterms–binaryweights

33

IRModelsVectorspacemodel

•  Document=mdimensionalspace(m=indexterms)

•  Eachtermrepresentsadimension

•  ComponentofadocumentvectoralongagivendirecEontermimportance

•  Queryanddocumentsarerepresentedasvectors

34

1/30/11

18

IRModelsVectorspacemodel

•  Documentsimilarity– Cosineangle

•  Benefits– TermweighEng– ParEalmatching

•  Shortcomings– Termindependency

35

t1

t2

d1

d2

Cosineangle

IRModelsVectorspacemodel

•  Example–  Query=“springerbook”–  q=(0,0,0,1,1)–  Sim(d1,q)=(0.176+0.176)/(√1+√(0.1762+0.1762+0.4172+0.1762+0.1762)=0.228

–  Sim(d2,q)=(0.528)/(√1+√(0.3502+0.5282))=0.323

–  Sim(d3,q)=(0.176)/(√1+√(0.5282+0.1762))=0.113

Document benef aGract sav springer book

d1 0.176 0.176 0.417 0.176 0.176

d2 0.000 0.350 0.000 0.528 0.000

d3 0.528 0.000 0.000 0.000 0.176

36

1/30/11

19

IRModelsVectorspacemodel

•  Document–documentsimilarity•  Sim(d1,d2)=0.447

•  Sim(d2,d3)=0.0

•  Sim(d1,d3)=0.408

Document benef aGract sav springer book

d1 0.176 0.176 0.417 0.176 0.176

d2 0.000 0.350 0.000 0.528 0.000

d3 0.528 0.000 0.000 0.000 0.176

37

Curseofdimensionality

•  TDT4– |D|=96,260– |ITD|=118,205

•  Iflinearlycalculatessim(q,D)– 96,260(pereachdocument)*118,205(innerproduct)comparisons

•  However,documentmatricesareverysparse– Mostly0’s– Space,calculaEoninefficienttostorethose0’s

38

1/30/11

20

Curseofdimensionality

•  Invertedindex–  Indexfromtermtodocument

39

Web‐IRdocumentrepresenta+on

•  EnhancestheclassicVSM•  PossibiliEesofferedbyHTMLlanguages

•  Tag‐based•  Link‐based– HITS– PageRank

40

1/30/11

21

Web‐IRTag‐basedapproaches

•  Givedifferentweightstodifferenttags– Sometextfragmentswithinatagmaybemoreimportantthanothers

– <body>,<Etle>,<h1>,<h2>,<h3>,<a>…

41

Web‐IRTag‐basedapproaches

•  WEBORsystem•  Sixclassesoftags

•  CIV=classimportancevector•  TFV=classfrequencyvector

42

1/30/11

22

Web‐IRTag‐basedapproaches

•  TermweighEngexample– CIV={0.6,1.0,0.8,0.5,0.7,0.8,0.5}– TFV(“personalizaEon”)={0,3,3,0,0,8,10}– W(“personalizaEon”)=(0.0+3.0+2.4+0.0+0.0+6.4+5.0)*IDF

43

Web‐IRHITS(Hyperlink‐InducedTopicSearch)

•  Link‐basedapproach•  PromotesearchperformancebyconsideringWebdocumentlinks

•  WorksonaniniEalsetofretrieveddocuments•  HubandauthoriEes– Agoodauthoritypageisonethatispointedtobymanygoodhubpages

– Agoodhubpageisonethatispointedtobymanygoodauthoritypages

–  Circular defini&on iteraEvecomputaEon

44

1/30/11

23

Web‐IRHITS(Hyperlink‐InducedTopicSearch)

•  IteraEveupdateofauthority&hubvectors

45

Web‐IRHITS(Hyperlink‐InducedTopicSearch)

46

1/30/11

24

Web‐IRHITS(Hyperlink‐InducedTopicSearch)

•  N=1– A=[0.3710.5570.743]– H=[0.6670.6670.333]

•  N=10– A=[0.3440.5730.744]– H=[0.7220.6190.309]

•  N=1000– A=[0.3280.5910.737]– H=[0.7370.5910.328]

47

Doc0

Doc2Doc1

Web‐IRPageRank

•  Google•  UnlikeHITS

–  NotlimitedtoaspecificiniEalretrievedsetofdocuments

–  Singlevalue

•  IniEalstate=nolink•  Evenlydividescoresto4

documents•  PR(A)=PR(B)=PR(C)=

PRD(D)=1/4=0.25

48

A

B

C

D

1/30/11

25

Web‐IRPageRank

•  PR(A)=PR(B)+PR(C)+PR(D)=0.25+0.25+0.25=0.75

49

A

B

C

D

Web‐IRPageRank

•  PR(B)toA=0.25/2=0.125•  PR(B)toC=0.25/2=0.125•  PR(A)

=PR(B)+PR(C)+PR(D)=0.125+0.25+0.25=0.625

50

A

B

C

D

1/30/11

26

Web‐IRPageRank

•  PR(D)toA=0.25/3=0.083•  PR(D)toB=0.25/3=0.083•  PR(D)toC=0.25/3=0.083•  PR(A)=PR(B)+PR(C)+PR(D)=

0.125+0.25+0.083=0.458

•  RecursivelykeepcalculaEngtofurtherdocumentslinkingtoA,B,C,andD

51

A

B

C

D

Concept‐baseddocumentmodelingLSI(LatentSemanEcIndexing)

•  Representsdocumentsbyconcepts– Notbyterms

•  Reducetermspaceconceptspace– Linearalgebratechnique:SVD(SingularValueDecomposiEon)

•  Step(1):MatrixdecomposiEon–originaldocumentmatrixAisfactoredintothreematrices

52

1/30/11

27

Concept‐baseddocumentmodelingLSI

•  Step(2):ArankkisselectedfromtheoriginalequaEon(k=reduced#ofconceptspace)

•  Step(3):Theoriginalterm‐documentmatrixAisconvertedtoAk

53

Concept‐baseddocumentmodelingLSI

•  Document‐termmatrixA

54

1/30/11

28

Concept‐baseddocumentmodelingLSI

•  DecomposiEon

55

Concept‐baseddocumentmodelingLSI

•  LowrankapproximaEon(k=2)

56

1/30/11

29

Concept‐baseddocumentmodelingLSI

•  FinalAk– Columns:documents

– Rows:concepts(k=2)

57

‐1

‐0.8

‐0.6

‐0.4

‐0.2

0

0.2

0.4

0.6

‐0.9 ‐0.8 ‐0.7 ‐0.6 ‐0.5 ‐0.4 ‐0.3 ‐0.2 ‐0.1 0

VT=SVDDocumentMatrix

d2

d3

d1

AI‐basedapproachesArEficialNeuralNetworks

•  Query,term,documentsseparatedinto3layers

•  Term‐documentweight=norm.TF‐IDF

•  QuerytermacEvaEonsumofthesignalsexceedsathresholddocumentretrieval

58

1/30/11

30

AI‐basedapproachesSemanEcNetworks

•  Conceptualknowledge

•  RelaEonshipbetweenconcepts

59

AI‐basedapproachesBayesianNetworks

60

•  MetzlerandCroc(2004)–  IndrisearchenginebasedonInQuery

•  Inferencenetwork– Document–  RepresentaEon(term,phrases)

– Query–  InformaEonNeed

•  Calculatesprobabilityofeachdocumentfromthenetwork