Informa(on Retrieval - CLSP · quickly find documents that contain both terms. – (This operaon...

Preview:

Citation preview

Informa(onRetrieval

Dr.QaiserAbbasDepartmentofComputerScience&IT,

UniversityofSargodha,Sargodha,40100,Pakistanqaiser.abbas@uos.edu.pk

Saturday,27February16 1

ProcessingBooleanqueries•  HowdoweprocessaqueryusinganinvertedindexandthebasicBoolean

retrievalmodel?Considerprocessingthesimpleconjunc-vequery:BrutusANDCalpurniaovertheinvertedindexparSallyshowninFigure1.3.We:–  1.LocateBrutusintheDicSonary–  2.RetrieveitsposSngs–  3.LocateCalpurniaintheDicSonary–  4.RetrieveitsposSngs–  5.IntersectthetwoposSngslists,asshowninFigure1.5.

Saturday,27February16 2

ProcessingBooleanqueries

•  Theintersec-onoperaSonisthecrucialone:weneedtoefficientlyintersectposSngslistssoastobeabletoquicklyfinddocumentsthatcontainbothterms.–  (ThisoperaSonissomeSmesreferredtoasmergingposSngslists:thisslightlycounterintuiSve(contrarytointuiSonortocommon-senseexpectaSon.)namereflectsusingthetermmergealgorithmforageneralfamilyofalgorithm;herewearemergingthelistswithalogicalANDoperaSon.)

•  ThereisasimpleandeffecSvemethodofintersecSngposSngslistsusingthemergealgorithm(seeFigure1.6).

Saturday,27February16 3

ProcessingBooleanqueries

Saturday,27February16 4

ProcessingBooleanqueries

Saturday,27February16 5

•  wemaintainpointersintobothlistsandwalkthroughthetwoposSngslistssimultaneously,inSmelinearinthetotalnumberofposSngsentries.

•  Ateachstep,wecomparethedocIDpointedtobybothpointers.Iftheyarethesame,weputthatdocIDintheresultslist,andadvancebothpointers.OtherwiseweadvancethepointerpoinSngtothesmallerdocID.

•  IfthelengthsoftheposSngslistsarexandy,theintersecSontakesO(x+y)operaSons.

•  Tousethisalgorithm,itiscrucialthatposSngsbesortedbyasingleglobalordering.UsinganumericsortbydocIDisonesimplewaytoachievethis.

ProcessingBooleanqueries

Saturday,27February16 6

•  Exercise1.4(page12)Forthequeriesbelow,canwesSllrunthroughtheintersecSoninSmeO(x+y),wherexandyarethelengthsoftheposSngslistsforBrutusandCaeser?Ifnot,whatcanweachieve?a.BrutusANDNOTCaeserb.BrutusORNOTCaeserSolu(ona.Page10ofthebookdefinesthecomplexityofqueryingO(N)asO(x+y)wherexandyarelengthsoftheposSngsliststobeintersected.ForthegivencondiSonBrutusANDNOTCaeser,considerthefollowingposSngslistCase1-whentheposSngslistforBrutushaslessernumberofposSngsthanthatforCaeser:–  Brutus1 3 10 21–  Caesar1 6 9 23 45 57WehavetofindthesetofdocumentsthathaveBrutusanddonothaveCaeser.Weusethefollowinglogic

ProcessingBooleanqueries

Saturday,27February16 7

PosiSonpointerp1tothefirstposSngintheposSngslistforBrutusandpointerp2tothefirstposSnginposSngslistfortermCaesar.ComparetheDocIDspointedbyeachpointer(CompareDocID(p1)andDocID(p2)

1.  IfDocID(p1)=DocID(p2),thenitmeansthatthedocIDinthatposSngcontainsboththetermsBrutusandCaeser.Wedonotwantthis.SomovetothenextposSnginboththelists.Gotopoint2(compareoperaSon).

2.  IfDocID(p1)<DocID(p2),thenitmeansthatDocID(p1)hasthetermBrutusANDNOTCaeser.Thisiswhatwewant,sostoretheDocID(p1)inananswerarrayMovethepointerfortermBrutustothenextposSnginthelist.Gotopoint2(CompareoperaSon)

3.  IfDocID(p1)>DocID(p2),WemovethepointerforCaesertothenextposSng.Gotopoint2(compareoperaSon).

WerunthecompareandincrementloopSllthep1pointstoNULLandthenwestoptheoperaSon.

WedonotneedtoruntheoperaSonSllp2pointstoNULLaswerequiretheDocIDsthathaveBrutusinit.e.g.wedonothavetoconsiderposSngs45and57inthelistforCaeser.

Answer=3,10,21

ProcessingBooleanqueries

Saturday,27February16 8

Thus,theComplexityofQueryingisO(x+y1),wherexisthelengthoftheposSngslistforthetermthathastobeintheexpressionandy1isthelengthoftheposSngslisttraversed,forthetermtobeexcluded,whenxreachesnull.Inthiscase,O(x+y1)<=O(x+y),wherey1<=yCase2-WhenposSngslistforBrutusisgreaterthanthatforCaeser.Brutus1 5 11 21 45 55Caeser1 11 170Eveninthiscase,theenSrelengthofposSngslistforBrutushastobetraversed(x),onlythatlengthofposSngslistforCaeserhastobetraversed(y1),Sllp1reachesnull,ThustheComplexityofQueryingisO(x+y1),wherey1<=y

ProcessingBooleanqueries

Saturday,27February16 9

Solu(onb.ForBrutusORNOTCaeser,weneedtofinddocumentshavingthetermBrutus,cannothaveCaeserOrnothavingthetermCaeser,canhaveBrutusorcannothaveBrutus.Brutus1 5 11 21 45 55Caeser1 11 170Other2 10 11 33 34HeretheenSrelength(x)ofposSngslistforBrutushastobetraversedtofindDocIDsthatcontainBrutus,TheenSrelength(z)ofposSngslistforOtherhastobetraversed.Similarly,Thelength(y1)ofposSngslistforCaeserhastobetraversedSllxandzbothreachNULL.ThustheComplexityofQueryingisO(x+y1+z),wherey1<=y

ProcessingBooleanqueries

Saturday,27February16 10

•  WecanextendtheintersecSonoperaSontoprocessmorecomplicatedquerieslike:–  (BrutusORCaesar)ANDNOTCalpurnia

•  Queryop)miza)onistheprocessofselecSnghowtoorganizetheworkofansweringaquerysothattheleasttotalamountofworkneedstobedonebythesystem.– AmajorelementofthisforBooleanqueriesistheorderinwhichposSngslistsareaccessed.Whatisthebestorderforqueryprocessing?

ProcessingBooleanqueries

Saturday,27February16 11

•  ConsideraquerythatisanANDoftterms,forinstance:–  BrutusANDCaesarANDCalpurnia

•  Foreachofthetterms,weneedtogetitsposSngs,thenANDthemtogether.ThestandardheurisScistoprocesstermsinorderofincreasingdocumentfrequency:–  ifwestartbyintersecSngthetwosmallestposSngslists,

thenallintermediateresultsmustbenobiggerthanthesmallestposSngslist,andwearethereforelikelytodotheleastamountoftotalwork.So,fortheposSngslistsinFigure1.3,weexecutetheabovequeryas:

–  (CalpurniaANDBrutus)ANDCaesar•  ThisisafirstjusSficaSonforkeepingthefrequencyof

termsinthedicSonary:itallowsustomakethisordering-decisionbasedonin-memorydatabeforeaccessinganyposSngslist.

ProcessingBooleanqueries

Saturday,27February16 12

•  ConsidernowtheopSmizaSonofmoregeneralqueries,suchas:–  (maddingORcrowd)AND(ignobleORstrife)AND(killedORslain)

•  Asbefore,wewillgetthefrequenciesforallterms,andwecanthen(conservaSvely)esSmatethesizeofeachORbythesumofthefrequenciesofitsdisjuncts.WecanthenprocessthequeryinincreasingorderofthesizeofeachdisjuncSveterm.

ProcessingBooleanqueries

Saturday,27February16 13

•  Exercise1.7[⋆]– Recommendaqueryprocessingorderfor(tangerineORtrees)AND(marmaladeORskies)AND(kaleidoscopeOReyes)

– giventhefollowingposSngslistsizes:

ProcessingBooleanqueries

Saturday,27February16 14

•  Exercise1.7[⋆]–  (kaleidoscopeOReyes)(300,321)AND(tangerineORtrees)(363,465)AND(marmaladeORskies)(379,571)

– However,dependingontheactualdistribuSonofposSngs,(tangerineORtrees)maywellbelongerthan(marmaladeORskies),becausethetwocomponentsoftheformeraremoreasymmetric.

AssignmentNo.2

Saturday,27February16 15

AssignmentNo.2

Saturday,27February16 16

TheextendedBooleanmodelversusrankedretrieval

•  TheBooleanretrievalmodelcontrastswithrankedretrievalmodelssuchasthevectorspacemodel(SecSon6.3),inwhichuserslargelyusefreetextqueries,thatis,justtypingoneormorewordsratherthanusingapreciselanguagewithoperatorsforbuildingupqueryexpressions,andthesystemdecideswhichdocumentsbestsaSsfythequery.

•  AstrictBooleanexpressionovertermswithanunorderedresultssetistoolimitedformanyoftheinformaSonneedsthatpeoplehave,andthesesystemsimplementedextendedBooleanretrievalmodelsbyincorporaSngaddiSonaloperatorssuchastermproximityoperators.

•  Aproximityoperatorisawayofspecifyingthattwotermsinaquerymustoccurclosetoeachotherinadocument,whereclosenessmaybemeasuredbylimiSngtheallowednumberofinterveningwordsorbyreferencetoastructuralunitsuchasasentenceorparagraph.

Saturday,27February16 17

TheextendedBooleanmodelversusrankedretrieval

•  Example1.1:CommercialBooleansearching:Westlaw.Westlaw(hop://www.westlaw.com/)isthelargestcommerciallegalsearchservice(intermsofthenumberofpayingsub-scribers),withoverhalfamillionsubscribersperformingmillionsofsearchesadayovertensofterabytesoftextdata.Theservicewasstartedin1975.In2005,Booleansearch(called“TermsandConnectors”byWestlaw)wassSllthedefault,andusedbyalargepercentageofusers,althoughrankedfreetextquerying(called“NaturalLanguage”byWestlaw)wasaddedin1992.HerearesomeexampleBooleanqueriesonWestlaw:–  Informa-onneed:InformaSononthelegaltheoriesinvolvedinprevenSngthe

disclosureoftradesecretsbyemployeesformerlyemployedbyacompeSngcompany.Query:"tradesecret"/sdisclos!/sprevent/semploye!

–  Informa-onneed:Requirementsfordisabledpeopletobeabletoaccessawork-place.Query:disab!/paccess!/swork-sitework-place(employment/3place)

–  Informa-onneed:Casesaboutahost’sresponsibilityfordrunkguests.Query:host!/p(responsib!liab!)/p(intoxicat!drunk!)/pguest

Saturday,27February16 18

TheextendedBooleanmodelversusrankedretrieval

•  Notethelong,precisequeriesandtheuseofproximityoperators,bothuncommoninwebsearch.Submioedqueriesaverageabouttenwordsinlength.UnlikewebsearchconvenSons,aspacebetweenwordsrepresentsdisjuncSon(theSghtestbindingoperator),&isANDand/s,/p,and/kaskformatchesinthesamesentence,sameparagraphorwithinkwordsrespecSvely.Doublequotesgiveaphrasesearch(consecuSvewords);seeSecSon2.4(page39).TheexclamaSonmark(!)givesatrailingwildcardquery(seeSecSon3.2,page51);thusliab!matchesallwordsstarSngwithliab.AddiSonallywork-sitematchesanyofworksite,work-siteorworksite;seeSecSon2.2.1(page22).TypicalexpertqueriesareusuallycarefullydefinedandincrementallydevelopedunSltheyobtainwhatlooktobegoodresultstotheuser.

Saturday,27February16 19

TheextendedBooleanmodelversusrankedretrieval

•  HerewejustmenSonafewofthemainaddiSonalthingswewouldliketobeabletodo:–  WewouldliketobeoerdeterminethesetoftermsinthedicSonaryandto

provideretrievalthatistoleranttospellingmistakesandinconsistentchoiceofwords.

–  Itisovenusefultosearchforcompoundsorphrasesthatdenoteaconceptsuchas“operaSngsystem”.AstheWestlawexamplesshow,wemightalsowishtodoproximityqueriessuchasGatesNEARMicrosoL.Toanswersuchqueries,theindexhastobeaugmentedtocapturetheproximiSesoftermsindocuments.

–  ABooleanmodelonlyrecordstermpresenceorabsence,butovenwewouldliketoaccumulatemorefrequentevidence.TobeabletodothisweneedtermfrequencyinformaSoninposSngslists.

–  Booleanqueriesjustretrieveasetofmatchingdocuments,butcommonlywewishtohaveaneffecSvemethodtoorder(or“rank”)thereturnedresults.Thisrequireshavingamechanismfordeterminingadocumentscorewhichencapsulateshowgoodamatchadocumentisforaquery.

Saturday,27February16 20

TheextendedBooleanmodelversusrankedretrieval

•  Exercise1.12[⋆]WriteaqueryusingWestlawsyntaxwhichwouldfindanyofthewordsprofessor,teacher,orlecturerinthesamesentenceasaformoftheverbexplain.

Saturday,27February16 21

Recommended