21
Informa(on Retrieval Dr. Qaiser Abbas Department of Computer Science & IT, University of Sargodha, Sargodha, 40100, Pakistan [email protected] Saturday, 27 February 16 1

Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

  • Upload
    others

  • View
    6

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

Informa(onRetrieval

Dr.QaiserAbbasDepartmentofComputerScience&IT,

UniversityofSargodha,Sargodha,40100,[email protected]

Saturday,27February16 1

Page 2: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

ProcessingBooleanqueries•  HowdoweprocessaqueryusinganinvertedindexandthebasicBoolean

retrievalmodel?Considerprocessingthesimpleconjunc-vequery:BrutusANDCalpurniaovertheinvertedindexparSallyshowninFigure1.3.We:–  1.LocateBrutusintheDicSonary–  2.RetrieveitsposSngs–  3.LocateCalpurniaintheDicSonary–  4.RetrieveitsposSngs–  5.IntersectthetwoposSngslists,asshowninFigure1.5.

Saturday,27February16 2

Page 3: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

ProcessingBooleanqueries

•  Theintersec-onoperaSonisthecrucialone:weneedtoefficientlyintersectposSngslistssoastobeabletoquicklyfinddocumentsthatcontainbothterms.–  (ThisoperaSonissomeSmesreferredtoasmergingposSngslists:thisslightlycounterintuiSve(contrarytointuiSonortocommon-senseexpectaSon.)namereflectsusingthetermmergealgorithmforageneralfamilyofalgorithm;herewearemergingthelistswithalogicalANDoperaSon.)

•  ThereisasimpleandeffecSvemethodofintersecSngposSngslistsusingthemergealgorithm(seeFigure1.6).

Saturday,27February16 3

Page 4: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

ProcessingBooleanqueries

Saturday,27February16 4

Page 5: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

ProcessingBooleanqueries

Saturday,27February16 5

•  wemaintainpointersintobothlistsandwalkthroughthetwoposSngslistssimultaneously,inSmelinearinthetotalnumberofposSngsentries.

•  Ateachstep,wecomparethedocIDpointedtobybothpointers.Iftheyarethesame,weputthatdocIDintheresultslist,andadvancebothpointers.OtherwiseweadvancethepointerpoinSngtothesmallerdocID.

•  IfthelengthsoftheposSngslistsarexandy,theintersecSontakesO(x+y)operaSons.

•  Tousethisalgorithm,itiscrucialthatposSngsbesortedbyasingleglobalordering.UsinganumericsortbydocIDisonesimplewaytoachievethis.

Page 6: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

ProcessingBooleanqueries

Saturday,27February16 6

•  Exercise1.4(page12)Forthequeriesbelow,canwesSllrunthroughtheintersecSoninSmeO(x+y),wherexandyarethelengthsoftheposSngslistsforBrutusandCaeser?Ifnot,whatcanweachieve?a.BrutusANDNOTCaeserb.BrutusORNOTCaeserSolu(ona.Page10ofthebookdefinesthecomplexityofqueryingO(N)asO(x+y)wherexandyarelengthsoftheposSngsliststobeintersected.ForthegivencondiSonBrutusANDNOTCaeser,considerthefollowingposSngslistCase1-whentheposSngslistforBrutushaslessernumberofposSngsthanthatforCaeser:–  Brutus1 3 10 21–  Caesar1 6 9 23 45 57WehavetofindthesetofdocumentsthathaveBrutusanddonothaveCaeser.Weusethefollowinglogic

Page 7: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

ProcessingBooleanqueries

Saturday,27February16 7

PosiSonpointerp1tothefirstposSngintheposSngslistforBrutusandpointerp2tothefirstposSnginposSngslistfortermCaesar.ComparetheDocIDspointedbyeachpointer(CompareDocID(p1)andDocID(p2)

1.  IfDocID(p1)=DocID(p2),thenitmeansthatthedocIDinthatposSngcontainsboththetermsBrutusandCaeser.Wedonotwantthis.SomovetothenextposSnginboththelists.Gotopoint2(compareoperaSon).

2.  IfDocID(p1)<DocID(p2),thenitmeansthatDocID(p1)hasthetermBrutusANDNOTCaeser.Thisiswhatwewant,sostoretheDocID(p1)inananswerarrayMovethepointerfortermBrutustothenextposSnginthelist.Gotopoint2(CompareoperaSon)

3.  IfDocID(p1)>DocID(p2),WemovethepointerforCaesertothenextposSng.Gotopoint2(compareoperaSon).

WerunthecompareandincrementloopSllthep1pointstoNULLandthenwestoptheoperaSon.

WedonotneedtoruntheoperaSonSllp2pointstoNULLaswerequiretheDocIDsthathaveBrutusinit.e.g.wedonothavetoconsiderposSngs45and57inthelistforCaeser.

Answer=3,10,21

Page 8: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

ProcessingBooleanqueries

Saturday,27February16 8

Thus,theComplexityofQueryingisO(x+y1),wherexisthelengthoftheposSngslistforthetermthathastobeintheexpressionandy1isthelengthoftheposSngslisttraversed,forthetermtobeexcluded,whenxreachesnull.Inthiscase,O(x+y1)<=O(x+y),wherey1<=yCase2-WhenposSngslistforBrutusisgreaterthanthatforCaeser.Brutus1 5 11 21 45 55Caeser1 11 170Eveninthiscase,theenSrelengthofposSngslistforBrutushastobetraversed(x),onlythatlengthofposSngslistforCaeserhastobetraversed(y1),Sllp1reachesnull,ThustheComplexityofQueryingisO(x+y1),wherey1<=y

Page 9: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

ProcessingBooleanqueries

Saturday,27February16 9

Solu(onb.ForBrutusORNOTCaeser,weneedtofinddocumentshavingthetermBrutus,cannothaveCaeserOrnothavingthetermCaeser,canhaveBrutusorcannothaveBrutus.Brutus1 5 11 21 45 55Caeser1 11 170Other2 10 11 33 34HeretheenSrelength(x)ofposSngslistforBrutushastobetraversedtofindDocIDsthatcontainBrutus,TheenSrelength(z)ofposSngslistforOtherhastobetraversed.Similarly,Thelength(y1)ofposSngslistforCaeserhastobetraversedSllxandzbothreachNULL.ThustheComplexityofQueryingisO(x+y1+z),wherey1<=y

Page 10: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

ProcessingBooleanqueries

Saturday,27February16 10

•  WecanextendtheintersecSonoperaSontoprocessmorecomplicatedquerieslike:–  (BrutusORCaesar)ANDNOTCalpurnia

•  Queryop)miza)onistheprocessofselecSnghowtoorganizetheworkofansweringaquerysothattheleasttotalamountofworkneedstobedonebythesystem.– AmajorelementofthisforBooleanqueriesistheorderinwhichposSngslistsareaccessed.Whatisthebestorderforqueryprocessing?

Page 11: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

ProcessingBooleanqueries

Saturday,27February16 11

•  ConsideraquerythatisanANDoftterms,forinstance:–  BrutusANDCaesarANDCalpurnia

•  Foreachofthetterms,weneedtogetitsposSngs,thenANDthemtogether.ThestandardheurisScistoprocesstermsinorderofincreasingdocumentfrequency:–  ifwestartbyintersecSngthetwosmallestposSngslists,

thenallintermediateresultsmustbenobiggerthanthesmallestposSngslist,andwearethereforelikelytodotheleastamountoftotalwork.So,fortheposSngslistsinFigure1.3,weexecutetheabovequeryas:

–  (CalpurniaANDBrutus)ANDCaesar•  ThisisafirstjusSficaSonforkeepingthefrequencyof

termsinthedicSonary:itallowsustomakethisordering-decisionbasedonin-memorydatabeforeaccessinganyposSngslist.

Page 12: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

ProcessingBooleanqueries

Saturday,27February16 12

•  ConsidernowtheopSmizaSonofmoregeneralqueries,suchas:–  (maddingORcrowd)AND(ignobleORstrife)AND(killedORslain)

•  Asbefore,wewillgetthefrequenciesforallterms,andwecanthen(conservaSvely)esSmatethesizeofeachORbythesumofthefrequenciesofitsdisjuncts.WecanthenprocessthequeryinincreasingorderofthesizeofeachdisjuncSveterm.

Page 13: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

ProcessingBooleanqueries

Saturday,27February16 13

•  Exercise1.7[⋆]– Recommendaqueryprocessingorderfor(tangerineORtrees)AND(marmaladeORskies)AND(kaleidoscopeOReyes)

– giventhefollowingposSngslistsizes:

Page 14: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

ProcessingBooleanqueries

Saturday,27February16 14

•  Exercise1.7[⋆]–  (kaleidoscopeOReyes)(300,321)AND(tangerineORtrees)(363,465)AND(marmaladeORskies)(379,571)

– However,dependingontheactualdistribuSonofposSngs,(tangerineORtrees)maywellbelongerthan(marmaladeORskies),becausethetwocomponentsoftheformeraremoreasymmetric.

Page 15: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

AssignmentNo.2

Saturday,27February16 15

Page 16: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

AssignmentNo.2

Saturday,27February16 16

Page 17: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

TheextendedBooleanmodelversusrankedretrieval

•  TheBooleanretrievalmodelcontrastswithrankedretrievalmodelssuchasthevectorspacemodel(SecSon6.3),inwhichuserslargelyusefreetextqueries,thatis,justtypingoneormorewordsratherthanusingapreciselanguagewithoperatorsforbuildingupqueryexpressions,andthesystemdecideswhichdocumentsbestsaSsfythequery.

•  AstrictBooleanexpressionovertermswithanunorderedresultssetistoolimitedformanyoftheinformaSonneedsthatpeoplehave,andthesesystemsimplementedextendedBooleanretrievalmodelsbyincorporaSngaddiSonaloperatorssuchastermproximityoperators.

•  Aproximityoperatorisawayofspecifyingthattwotermsinaquerymustoccurclosetoeachotherinadocument,whereclosenessmaybemeasuredbylimiSngtheallowednumberofinterveningwordsorbyreferencetoastructuralunitsuchasasentenceorparagraph.

Saturday,27February16 17

Page 18: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

TheextendedBooleanmodelversusrankedretrieval

•  Example1.1:CommercialBooleansearching:Westlaw.Westlaw(hop://www.westlaw.com/)isthelargestcommerciallegalsearchservice(intermsofthenumberofpayingsub-scribers),withoverhalfamillionsubscribersperformingmillionsofsearchesadayovertensofterabytesoftextdata.Theservicewasstartedin1975.In2005,Booleansearch(called“TermsandConnectors”byWestlaw)wassSllthedefault,andusedbyalargepercentageofusers,althoughrankedfreetextquerying(called“NaturalLanguage”byWestlaw)wasaddedin1992.HerearesomeexampleBooleanqueriesonWestlaw:–  Informa-onneed:InformaSononthelegaltheoriesinvolvedinprevenSngthe

disclosureoftradesecretsbyemployeesformerlyemployedbyacompeSngcompany.Query:"tradesecret"/sdisclos!/sprevent/semploye!

–  Informa-onneed:Requirementsfordisabledpeopletobeabletoaccessawork-place.Query:disab!/paccess!/swork-sitework-place(employment/3place)

–  Informa-onneed:Casesaboutahost’sresponsibilityfordrunkguests.Query:host!/p(responsib!liab!)/p(intoxicat!drunk!)/pguest

Saturday,27February16 18

Page 19: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

TheextendedBooleanmodelversusrankedretrieval

•  Notethelong,precisequeriesandtheuseofproximityoperators,bothuncommoninwebsearch.Submioedqueriesaverageabouttenwordsinlength.UnlikewebsearchconvenSons,aspacebetweenwordsrepresentsdisjuncSon(theSghtestbindingoperator),&isANDand/s,/p,and/kaskformatchesinthesamesentence,sameparagraphorwithinkwordsrespecSvely.Doublequotesgiveaphrasesearch(consecuSvewords);seeSecSon2.4(page39).TheexclamaSonmark(!)givesatrailingwildcardquery(seeSecSon3.2,page51);thusliab!matchesallwordsstarSngwithliab.AddiSonallywork-sitematchesanyofworksite,work-siteorworksite;seeSecSon2.2.1(page22).TypicalexpertqueriesareusuallycarefullydefinedandincrementallydevelopedunSltheyobtainwhatlooktobegoodresultstotheuser.

Saturday,27February16 19

Page 20: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

TheextendedBooleanmodelversusrankedretrieval

•  HerewejustmenSonafewofthemainaddiSonalthingswewouldliketobeabletodo:–  WewouldliketobeoerdeterminethesetoftermsinthedicSonaryandto

provideretrievalthatistoleranttospellingmistakesandinconsistentchoiceofwords.

–  Itisovenusefultosearchforcompoundsorphrasesthatdenoteaconceptsuchas“operaSngsystem”.AstheWestlawexamplesshow,wemightalsowishtodoproximityqueriessuchasGatesNEARMicrosoL.Toanswersuchqueries,theindexhastobeaugmentedtocapturetheproximiSesoftermsindocuments.

–  ABooleanmodelonlyrecordstermpresenceorabsence,butovenwewouldliketoaccumulatemorefrequentevidence.TobeabletodothisweneedtermfrequencyinformaSoninposSngslists.

–  Booleanqueriesjustretrieveasetofmatchingdocuments,butcommonlywewishtohaveaneffecSvemethodtoorder(or“rank”)thereturnedresults.Thisrequireshavingamechanismfordeterminingadocumentscorewhichencapsulateshowgoodamatchadocumentisforaquery.

Saturday,27February16 20

Page 21: Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon is someSmes referred to as merging posSngs lists: this slightly counterintuiSve (contrary

TheextendedBooleanmodelversusrankedretrieval

•  Exercise1.12[⋆]WriteaqueryusingWestlawsyntaxwhichwouldfindanyofthewordsprofessor,teacher,orlecturerinthesamesentenceasaformoftheverbexplain.

Saturday,27February16 21