Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Modern Information Retrieval
Chapter 7
Queries Languages amp Properties
Query LanguagesQuery Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 1
Queries Languages amp PropertiesThis chapter covers the main aspects of queriesincluding
the different languages used to express them
their distribution and approaches for analyzing them with focus onthe Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 2
Query Languages
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 3
Query LanguagesWe cover now the different kind of queries normallyposed to text retrieval systems
This is in part dependent on the retrieval model thesystem adopts
That is a full-text system will not answer the same kind of queriesas those answered by a system based on keyword ranking
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 4
Query LanguagesThere is a difference between information retrievaland data retrieval
Languages for information retrieval allow the answer tobe ranked
For query languages not aimed at information retrievalthe concept of ranking cannot be easily defined
We consider these languages as languages for data retrieval
Some query languages are not intended for final users
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 5
Query LanguagesThere are a number of techniques to enhance theusefulness of the queries
Some examples are the expansion of a word to the setof its synonyms or the use of a thesaurus
Some words which are very frequent and do not carrymeaning (called stopwords) may be removed
We refer to words that can be used to match queryterms as keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 6
Query LanguagesAnother issue is the subject of the retrieval unit theinformation retrieval system adopts
The retrieval unit is the basic element which can beretrieved as an answer to a query
We call the retrieval units simply documents even ifthis reference can be used with different meanings
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 7
Query LanguagesKeyword Based Querying
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8
Keyword Based QueryingA query is the formulation of a user information need
Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking
However a query can also be a more complexcombination of operations involving several words
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9
Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word
Some models are also able to see the internal divisionof words into letters
In this case the alphabet is split into letters and separators
A word is a sequence of letters surrounded by separators
The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10
Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query
Further the resulting documents are ranked accordingto the degree of similarity with respect to the query
To support ranking two common statistics on wordoccurrences inside texts are commonly used
The first is called term frequency and counts the number oftimes a word appears inside a document
The second is called inverse document frequency and countsthe number of documents in which a word appears
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11
Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form
In this case a document matches a query only if it contains allthe words in the query
This is useful when the number of results for one singleword is too large
Additionally it may be required that the exact positionsin which a word occurs in the text should be provided
This might be useful for highlighting word occurrencesin snippets for instance during the display of results
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12
Context QueriesMany systems complement queries with the ability tosearch words in a given context
Words which appear near each other may signal higherlikelihood of relevance than if they appear apart
We may want to form phrases of words or find wordswhich are proximal in the text
Phrase
Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words
Proximity
Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13
Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators
A boolean query has a syntax composed of
atoms basic queries that retrieve documents
boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents
This scheme is in general compositional operatorscan be composed over the results of other operators
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Queries Languages amp PropertiesThis chapter covers the main aspects of queriesincluding
the different languages used to express them
their distribution and approaches for analyzing them with focus onthe Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 2
Query Languages
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 3
Query LanguagesWe cover now the different kind of queries normallyposed to text retrieval systems
This is in part dependent on the retrieval model thesystem adopts
That is a full-text system will not answer the same kind of queriesas those answered by a system based on keyword ranking
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 4
Query LanguagesThere is a difference between information retrievaland data retrieval
Languages for information retrieval allow the answer tobe ranked
For query languages not aimed at information retrievalthe concept of ranking cannot be easily defined
We consider these languages as languages for data retrieval
Some query languages are not intended for final users
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 5
Query LanguagesThere are a number of techniques to enhance theusefulness of the queries
Some examples are the expansion of a word to the setof its synonyms or the use of a thesaurus
Some words which are very frequent and do not carrymeaning (called stopwords) may be removed
We refer to words that can be used to match queryterms as keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 6
Query LanguagesAnother issue is the subject of the retrieval unit theinformation retrieval system adopts
The retrieval unit is the basic element which can beretrieved as an answer to a query
We call the retrieval units simply documents even ifthis reference can be used with different meanings
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 7
Query LanguagesKeyword Based Querying
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8
Keyword Based QueryingA query is the formulation of a user information need
Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking
However a query can also be a more complexcombination of operations involving several words
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9
Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word
Some models are also able to see the internal divisionof words into letters
In this case the alphabet is split into letters and separators
A word is a sequence of letters surrounded by separators
The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10
Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query
Further the resulting documents are ranked accordingto the degree of similarity with respect to the query
To support ranking two common statistics on wordoccurrences inside texts are commonly used
The first is called term frequency and counts the number oftimes a word appears inside a document
The second is called inverse document frequency and countsthe number of documents in which a word appears
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11
Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form
In this case a document matches a query only if it contains allthe words in the query
This is useful when the number of results for one singleword is too large
Additionally it may be required that the exact positionsin which a word occurs in the text should be provided
This might be useful for highlighting word occurrencesin snippets for instance during the display of results
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12
Context QueriesMany systems complement queries with the ability tosearch words in a given context
Words which appear near each other may signal higherlikelihood of relevance than if they appear apart
We may want to form phrases of words or find wordswhich are proximal in the text
Phrase
Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words
Proximity
Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13
Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators
A boolean query has a syntax composed of
atoms basic queries that retrieve documents
boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents
This scheme is in general compositional operatorscan be composed over the results of other operators
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query Languages
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 3
Query LanguagesWe cover now the different kind of queries normallyposed to text retrieval systems
This is in part dependent on the retrieval model thesystem adopts
That is a full-text system will not answer the same kind of queriesas those answered by a system based on keyword ranking
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 4
Query LanguagesThere is a difference between information retrievaland data retrieval
Languages for information retrieval allow the answer tobe ranked
For query languages not aimed at information retrievalthe concept of ranking cannot be easily defined
We consider these languages as languages for data retrieval
Some query languages are not intended for final users
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 5
Query LanguagesThere are a number of techniques to enhance theusefulness of the queries
Some examples are the expansion of a word to the setof its synonyms or the use of a thesaurus
Some words which are very frequent and do not carrymeaning (called stopwords) may be removed
We refer to words that can be used to match queryterms as keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 6
Query LanguagesAnother issue is the subject of the retrieval unit theinformation retrieval system adopts
The retrieval unit is the basic element which can beretrieved as an answer to a query
We call the retrieval units simply documents even ifthis reference can be used with different meanings
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 7
Query LanguagesKeyword Based Querying
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8
Keyword Based QueryingA query is the formulation of a user information need
Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking
However a query can also be a more complexcombination of operations involving several words
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9
Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word
Some models are also able to see the internal divisionof words into letters
In this case the alphabet is split into letters and separators
A word is a sequence of letters surrounded by separators
The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10
Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query
Further the resulting documents are ranked accordingto the degree of similarity with respect to the query
To support ranking two common statistics on wordoccurrences inside texts are commonly used
The first is called term frequency and counts the number oftimes a word appears inside a document
The second is called inverse document frequency and countsthe number of documents in which a word appears
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11
Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form
In this case a document matches a query only if it contains allthe words in the query
This is useful when the number of results for one singleword is too large
Additionally it may be required that the exact positionsin which a word occurs in the text should be provided
This might be useful for highlighting word occurrencesin snippets for instance during the display of results
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12
Context QueriesMany systems complement queries with the ability tosearch words in a given context
Words which appear near each other may signal higherlikelihood of relevance than if they appear apart
We may want to form phrases of words or find wordswhich are proximal in the text
Phrase
Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words
Proximity
Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13
Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators
A boolean query has a syntax composed of
atoms basic queries that retrieve documents
boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents
This scheme is in general compositional operatorscan be composed over the results of other operators
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query LanguagesWe cover now the different kind of queries normallyposed to text retrieval systems
This is in part dependent on the retrieval model thesystem adopts
That is a full-text system will not answer the same kind of queriesas those answered by a system based on keyword ranking
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 4
Query LanguagesThere is a difference between information retrievaland data retrieval
Languages for information retrieval allow the answer tobe ranked
For query languages not aimed at information retrievalthe concept of ranking cannot be easily defined
We consider these languages as languages for data retrieval
Some query languages are not intended for final users
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 5
Query LanguagesThere are a number of techniques to enhance theusefulness of the queries
Some examples are the expansion of a word to the setof its synonyms or the use of a thesaurus
Some words which are very frequent and do not carrymeaning (called stopwords) may be removed
We refer to words that can be used to match queryterms as keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 6
Query LanguagesAnother issue is the subject of the retrieval unit theinformation retrieval system adopts
The retrieval unit is the basic element which can beretrieved as an answer to a query
We call the retrieval units simply documents even ifthis reference can be used with different meanings
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 7
Query LanguagesKeyword Based Querying
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8
Keyword Based QueryingA query is the formulation of a user information need
Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking
However a query can also be a more complexcombination of operations involving several words
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9
Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word
Some models are also able to see the internal divisionof words into letters
In this case the alphabet is split into letters and separators
A word is a sequence of letters surrounded by separators
The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10
Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query
Further the resulting documents are ranked accordingto the degree of similarity with respect to the query
To support ranking two common statistics on wordoccurrences inside texts are commonly used
The first is called term frequency and counts the number oftimes a word appears inside a document
The second is called inverse document frequency and countsthe number of documents in which a word appears
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11
Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form
In this case a document matches a query only if it contains allthe words in the query
This is useful when the number of results for one singleword is too large
Additionally it may be required that the exact positionsin which a word occurs in the text should be provided
This might be useful for highlighting word occurrencesin snippets for instance during the display of results
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12
Context QueriesMany systems complement queries with the ability tosearch words in a given context
Words which appear near each other may signal higherlikelihood of relevance than if they appear apart
We may want to form phrases of words or find wordswhich are proximal in the text
Phrase
Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words
Proximity
Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13
Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators
A boolean query has a syntax composed of
atoms basic queries that retrieve documents
boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents
This scheme is in general compositional operatorscan be composed over the results of other operators
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query LanguagesThere is a difference between information retrievaland data retrieval
Languages for information retrieval allow the answer tobe ranked
For query languages not aimed at information retrievalthe concept of ranking cannot be easily defined
We consider these languages as languages for data retrieval
Some query languages are not intended for final users
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 5
Query LanguagesThere are a number of techniques to enhance theusefulness of the queries
Some examples are the expansion of a word to the setof its synonyms or the use of a thesaurus
Some words which are very frequent and do not carrymeaning (called stopwords) may be removed
We refer to words that can be used to match queryterms as keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 6
Query LanguagesAnother issue is the subject of the retrieval unit theinformation retrieval system adopts
The retrieval unit is the basic element which can beretrieved as an answer to a query
We call the retrieval units simply documents even ifthis reference can be used with different meanings
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 7
Query LanguagesKeyword Based Querying
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8
Keyword Based QueryingA query is the formulation of a user information need
Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking
However a query can also be a more complexcombination of operations involving several words
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9
Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word
Some models are also able to see the internal divisionof words into letters
In this case the alphabet is split into letters and separators
A word is a sequence of letters surrounded by separators
The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10
Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query
Further the resulting documents are ranked accordingto the degree of similarity with respect to the query
To support ranking two common statistics on wordoccurrences inside texts are commonly used
The first is called term frequency and counts the number oftimes a word appears inside a document
The second is called inverse document frequency and countsthe number of documents in which a word appears
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11
Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form
In this case a document matches a query only if it contains allthe words in the query
This is useful when the number of results for one singleword is too large
Additionally it may be required that the exact positionsin which a word occurs in the text should be provided
This might be useful for highlighting word occurrencesin snippets for instance during the display of results
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12
Context QueriesMany systems complement queries with the ability tosearch words in a given context
Words which appear near each other may signal higherlikelihood of relevance than if they appear apart
We may want to form phrases of words or find wordswhich are proximal in the text
Phrase
Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words
Proximity
Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13
Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators
A boolean query has a syntax composed of
atoms basic queries that retrieve documents
boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents
This scheme is in general compositional operatorscan be composed over the results of other operators
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query LanguagesThere are a number of techniques to enhance theusefulness of the queries
Some examples are the expansion of a word to the setof its synonyms or the use of a thesaurus
Some words which are very frequent and do not carrymeaning (called stopwords) may be removed
We refer to words that can be used to match queryterms as keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 6
Query LanguagesAnother issue is the subject of the retrieval unit theinformation retrieval system adopts
The retrieval unit is the basic element which can beretrieved as an answer to a query
We call the retrieval units simply documents even ifthis reference can be used with different meanings
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 7
Query LanguagesKeyword Based Querying
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8
Keyword Based QueryingA query is the formulation of a user information need
Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking
However a query can also be a more complexcombination of operations involving several words
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9
Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word
Some models are also able to see the internal divisionof words into letters
In this case the alphabet is split into letters and separators
A word is a sequence of letters surrounded by separators
The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10
Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query
Further the resulting documents are ranked accordingto the degree of similarity with respect to the query
To support ranking two common statistics on wordoccurrences inside texts are commonly used
The first is called term frequency and counts the number oftimes a word appears inside a document
The second is called inverse document frequency and countsthe number of documents in which a word appears
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11
Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form
In this case a document matches a query only if it contains allthe words in the query
This is useful when the number of results for one singleword is too large
Additionally it may be required that the exact positionsin which a word occurs in the text should be provided
This might be useful for highlighting word occurrencesin snippets for instance during the display of results
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12
Context QueriesMany systems complement queries with the ability tosearch words in a given context
Words which appear near each other may signal higherlikelihood of relevance than if they appear apart
We may want to form phrases of words or find wordswhich are proximal in the text
Phrase
Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words
Proximity
Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13
Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators
A boolean query has a syntax composed of
atoms basic queries that retrieve documents
boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents
This scheme is in general compositional operatorscan be composed over the results of other operators
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query LanguagesAnother issue is the subject of the retrieval unit theinformation retrieval system adopts
The retrieval unit is the basic element which can beretrieved as an answer to a query
We call the retrieval units simply documents even ifthis reference can be used with different meanings
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 7
Query LanguagesKeyword Based Querying
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8
Keyword Based QueryingA query is the formulation of a user information need
Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking
However a query can also be a more complexcombination of operations involving several words
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9
Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word
Some models are also able to see the internal divisionof words into letters
In this case the alphabet is split into letters and separators
A word is a sequence of letters surrounded by separators
The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10
Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query
Further the resulting documents are ranked accordingto the degree of similarity with respect to the query
To support ranking two common statistics on wordoccurrences inside texts are commonly used
The first is called term frequency and counts the number oftimes a word appears inside a document
The second is called inverse document frequency and countsthe number of documents in which a word appears
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11
Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form
In this case a document matches a query only if it contains allthe words in the query
This is useful when the number of results for one singleword is too large
Additionally it may be required that the exact positionsin which a word occurs in the text should be provided
This might be useful for highlighting word occurrencesin snippets for instance during the display of results
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12
Context QueriesMany systems complement queries with the ability tosearch words in a given context
Words which appear near each other may signal higherlikelihood of relevance than if they appear apart
We may want to form phrases of words or find wordswhich are proximal in the text
Phrase
Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words
Proximity
Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13
Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators
A boolean query has a syntax composed of
atoms basic queries that retrieve documents
boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents
This scheme is in general compositional operatorscan be composed over the results of other operators
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query LanguagesKeyword Based Querying
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8
Keyword Based QueryingA query is the formulation of a user information need
Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking
However a query can also be a more complexcombination of operations involving several words
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9
Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word
Some models are also able to see the internal divisionof words into letters
In this case the alphabet is split into letters and separators
A word is a sequence of letters surrounded by separators
The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10
Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query
Further the resulting documents are ranked accordingto the degree of similarity with respect to the query
To support ranking two common statistics on wordoccurrences inside texts are commonly used
The first is called term frequency and counts the number oftimes a word appears inside a document
The second is called inverse document frequency and countsthe number of documents in which a word appears
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11
Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form
In this case a document matches a query only if it contains allthe words in the query
This is useful when the number of results for one singleword is too large
Additionally it may be required that the exact positionsin which a word occurs in the text should be provided
This might be useful for highlighting word occurrencesin snippets for instance during the display of results
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12
Context QueriesMany systems complement queries with the ability tosearch words in a given context
Words which appear near each other may signal higherlikelihood of relevance than if they appear apart
We may want to form phrases of words or find wordswhich are proximal in the text
Phrase
Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words
Proximity
Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13
Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators
A boolean query has a syntax composed of
atoms basic queries that retrieve documents
boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents
This scheme is in general compositional operatorscan be composed over the results of other operators
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Keyword Based QueryingA query is the formulation of a user information need
Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking
However a query can also be a more complexcombination of operations involving several words
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9
Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word
Some models are also able to see the internal divisionof words into letters
In this case the alphabet is split into letters and separators
A word is a sequence of letters surrounded by separators
The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10
Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query
Further the resulting documents are ranked accordingto the degree of similarity with respect to the query
To support ranking two common statistics on wordoccurrences inside texts are commonly used
The first is called term frequency and counts the number oftimes a word appears inside a document
The second is called inverse document frequency and countsthe number of documents in which a word appears
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11
Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form
In this case a document matches a query only if it contains allthe words in the query
This is useful when the number of results for one singleword is too large
Additionally it may be required that the exact positionsin which a word occurs in the text should be provided
This might be useful for highlighting word occurrencesin snippets for instance during the display of results
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12
Context QueriesMany systems complement queries with the ability tosearch words in a given context
Words which appear near each other may signal higherlikelihood of relevance than if they appear apart
We may want to form phrases of words or find wordswhich are proximal in the text
Phrase
Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words
Proximity
Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13
Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators
A boolean query has a syntax composed of
atoms basic queries that retrieve documents
boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents
This scheme is in general compositional operatorscan be composed over the results of other operators
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word
Some models are also able to see the internal divisionof words into letters
In this case the alphabet is split into letters and separators
A word is a sequence of letters surrounded by separators
The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10
Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query
Further the resulting documents are ranked accordingto the degree of similarity with respect to the query
To support ranking two common statistics on wordoccurrences inside texts are commonly used
The first is called term frequency and counts the number oftimes a word appears inside a document
The second is called inverse document frequency and countsthe number of documents in which a word appears
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11
Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form
In this case a document matches a query only if it contains allthe words in the query
This is useful when the number of results for one singleword is too large
Additionally it may be required that the exact positionsin which a word occurs in the text should be provided
This might be useful for highlighting word occurrencesin snippets for instance during the display of results
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12
Context QueriesMany systems complement queries with the ability tosearch words in a given context
Words which appear near each other may signal higherlikelihood of relevance than if they appear apart
We may want to form phrases of words or find wordswhich are proximal in the text
Phrase
Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words
Proximity
Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13
Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators
A boolean query has a syntax composed of
atoms basic queries that retrieve documents
boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents
This scheme is in general compositional operatorscan be composed over the results of other operators
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query
Further the resulting documents are ranked accordingto the degree of similarity with respect to the query
To support ranking two common statistics on wordoccurrences inside texts are commonly used
The first is called term frequency and counts the number oftimes a word appears inside a document
The second is called inverse document frequency and countsthe number of documents in which a word appears
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11
Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form
In this case a document matches a query only if it contains allthe words in the query
This is useful when the number of results for one singleword is too large
Additionally it may be required that the exact positionsin which a word occurs in the text should be provided
This might be useful for highlighting word occurrencesin snippets for instance during the display of results
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12
Context QueriesMany systems complement queries with the ability tosearch words in a given context
Words which appear near each other may signal higherlikelihood of relevance than if they appear apart
We may want to form phrases of words or find wordswhich are proximal in the text
Phrase
Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words
Proximity
Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13
Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators
A boolean query has a syntax composed of
atoms basic queries that retrieve documents
boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents
This scheme is in general compositional operatorscan be composed over the results of other operators
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form
In this case a document matches a query only if it contains allthe words in the query
This is useful when the number of results for one singleword is too large
Additionally it may be required that the exact positionsin which a word occurs in the text should be provided
This might be useful for highlighting word occurrencesin snippets for instance during the display of results
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12
Context QueriesMany systems complement queries with the ability tosearch words in a given context
Words which appear near each other may signal higherlikelihood of relevance than if they appear apart
We may want to form phrases of words or find wordswhich are proximal in the text
Phrase
Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words
Proximity
Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13
Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators
A boolean query has a syntax composed of
atoms basic queries that retrieve documents
boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents
This scheme is in general compositional operatorscan be composed over the results of other operators
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Context QueriesMany systems complement queries with the ability tosearch words in a given context
Words which appear near each other may signal higherlikelihood of relevance than if they appear apart
We may want to form phrases of words or find wordswhich are proximal in the text
Phrase
Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words
Proximity
Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13
Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators
A boolean query has a syntax composed of
atoms basic queries that retrieve documents
boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents
This scheme is in general compositional operatorscan be composed over the results of other operators
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators
A boolean query has a syntax composed of
atoms basic queries that retrieve documents
boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents
This scheme is in general compositional operatorscan be composed over the results of other operators
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Boolean QueriesA query syntax tree is naturally defined
Consider the example of a query syntax tree below
syntax syntactic
ORtranslation
AND
It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are
e1 OR e2 the query selects all documents which satisfy e1 or e2
e1 AND e2 selects all documents which satisfy both e1 and e2
e1 BUT e2 selects all documents which satisfy e1 but not e2
NOT e2 the query selects all documents which not contain e2
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided
A document either satisfies the boolean query or it does not
This is quite a limitation because it does not allow forpartial matching between a document and a user query
To overcome this limitation the condition for retrievalmust be relaxed
For instance a document which partially satisfies an ANDcondition might be retrieved
The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Boolean QueriesA fuzzy-boolean set of operators has been proposed
The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents
The documents are ranked higher when they have alarger number of elements in common with the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query LanguagesBeyond Keywords
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment
Those segments satisfying the pattern specificationsare said to match the pattern
We can search for documents containing segmentswhich match a given search pattern
Each system allows specifying some types of patterns
The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Pattern MatchingThe most used types of patterns are
Words a string which must be a word in the text
Prefixes a string which must form the beginning of a text word
Suffixes a string which must form the termination of a text word
Substrings a string which can appear within a text word
Ranges a pair of strings which matches any word whichlexicographically lies between them
Allowing errors a word together with an error threshold
Regular expressions a rather general pattern built up by simplestrings
Extended patterns a more user-friendly query language torepresent some common cases of regular expressions
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred
In this case a query becomes simply an enumeration ofwords and context queries
The negation can be handled by letting the userexpress that some words are not desired
Then the documents containing them are penalized in the rankingcomputation
Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query LanguagesStructural Queries
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Structural QueriesThe text collections tend to have some structure builtinto them
The standardization of languages to represent structured textshas pushed forward in this direction
Mixing contents and structure in queries allows posingvery powerful queries
Queries can be expressed using containment proximityor other restrictions on the structural elements
More details in Chapter 13 on Structured Text Retrieval
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Structural QueriesThe three main types of structures
a) form-like fixed structureb) hypertext structurec) hierarchical structure
a) b) c)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Fixed StructureThe structure allowed in texts was traditionally quiterestrictive
The documents had a fixed set of fields and each fieldhad some text inside
Some fields were not present in all documents
Some documents could have text not classified under any field
They were not allowed to nest or overlap
Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc
This idea leads naturally to the relational model eachfield corresponding to a column in the database table
There are several proposals that extend SQL to allowfull-text retrieval
Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power
Retrieval from hypertext began as a merely navigationalactivity
That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted
Some query tools allow querying hypertext based ontheir content and their structure
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure
An example of a hierarchical structure the page of abook and its schematic view
Chapter 6
We cover in this chapter the different kind of
61 Keyword Based
63 Structured Queries
chapter
section section
title title figure
We cover Keyword Based Structural
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Fixed StructureAn example of a query to the hierarchical structurepresented
IN
figure WITH
WITHsection
title ldquostructuralrdquo
This parsed query returns the image below
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query LanguagesQuery Protocols
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query ProtocolsSometimes query languages are used by applicationsto query text databases
Because they are not intended for human use we referto them as protocols rather than languages
The most important arewere
Z3950
Wide Area Information Service (WAIS)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols
The main goal of these protocols is to provide discinterchangeability
We can cite three of them
Common Command Language (CCL)
Compact Disk Read only Data exchange (CD-RDx)
Structured Full-text Query Language (SFQL)
SFQL is based on SQL and also has a client-serverarchitecture
The language does not define any specific formatting ormarkup
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query ProtocolsFor example a query in SFQL is
Select abstract from journalpaperswhere title contains text search
The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition
For example where paper contains retrievalor like info and date gt 1198
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query Properties
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs
People have different search needs at different timesand in different contexts
A number of researchers have attempted to classify andtally the types of information needs
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Characterizing Web QueriesMost web search engines record information aboutthe queries
This information includes the query itself the time and the IPaddress
Some systems also record which search results were clickedon for a given query
These logs are a valuable resource for understandingthe kinds of information needs that users have
The first studies concentrated on basic statistics
query occurrences
term occurrences
query length
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries
Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002
In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine
This study found that the mean length of the queries was 28terms with the longest query having 25 terms
Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references
Length Dogpile (2005) Yahoo (2007)
1 18 25
2 32 35
3 25 23
4 13 11
5 6 4
6 3 1
gt7 3 1
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Characterizing Web QueriesQueries as words in a text follow a biased distribution
In fact the frequency of query words follow a Zipfrsquos lawwith parameter α
The value of α ranges from 06 to 14 perhaps due to languageand cultural differences
However this is less biased than Web text where α iscloser to 2
The standard correlation among the frequency of aword in the Web pages and in the queries also varies
These values range from 015 to 042
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web
Relative query frequency vs relative documentfrequency of each word in a vocabulary
1e-07
1e-06
1e-05
1e-04
0001
001
01
1
1e-06 1e-05 1e-04 0001 001 01 1
Que
ry fr
eque
ncy
Document frequency
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search
Many people refines the query adding and removingwords but most of them see very few answer pages
The table below shows query statistics for four differentsearch engines
Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)
Words per query 24 26 23 11
Queries per user 20 23 29 ndash
Answer pages per query 13 17 22 12
Boolean queries lt40 10 ndash 16
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)
Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query PropertiesUser Search Behavior
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
User Search BehaviorState diagram of user behavior in a small search engine
024
075
013
006
086
009
006
062
00601
017
075
015
091
006
009
041
017
008
099
022
027
URL
035
gt02
005gt=xgt001
02gt=xgt005
009
00
Web
003
001
058
006
009
011
Home Page
Advanced search
Directory
Selected
Results
Refining
End session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest
The design of the search tool was always targeted tohelp the user write good queries
This implies that the query language adopted was usuallycomplex
The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query IntentBroderrsquos taxonomy of Web search goals
Navigational The immediate intent is to reach a particular site(245 survey 20 query log)
Informational The intent is to acquire some information (39survey 48 query log)
Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)
This taxonomy has been heavily influential indiscussions of query types on the Web
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos
They noted that much of what happens on the Web isthe acquisition and consumption of online resources
Thus they replace Broderrsquos transactions category with abroader category of resources
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query IntentRecent research have dealt with the automaticprediction of these classes
These works use machine learning over different queryattributes such as
Anchor-text distribution of the words in the queries
Past click behavior
The main problem is that many queries are inherentlyambiguous
They can be classified in more than one class when the contextof the search is not known
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent
For example a search involving the topic of weathercan consist of
the simple information need of looking at todayrsquos forecast or
the rich and complex information need of studying meteorology
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs
The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005
On the other hand commerce-related queries increased from13 to almost 25 in the same years
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query TopicChanges in Excite topics from 1997 to 2001 ()
Rank Topic 1997 2001
1 Commerce travel employment or economy 133 247
2 People places or things 67 197
3 Non-English or unknown 41 113
4 Computers or Internet 125 96
5 Sex or pornography 168 85
6 Health or sciences 95 75
7 Entertainment or recreation 199 66
8 Education or humanities 56 45
9 Society culture ethnicity or religion 54 39
10 Government 34 20
11 Performing or fine arts 54 11
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project
Their results F-score of about 45 on 63 categories
The table below shows the results for five queries
Query Top category Second category
chat rooms ComputersInternet Online CommunityChat
lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation
stephen hawking InfoScience amp Tech InfoArts amp Humanities
dog shampoo ShoppingBuying Guides LivingPets amp Animals
text mining ComputersSoftware InformationCompanies
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories
This is an important problem because rare or infrequent queriesare approximately half of all queries
Training data a commercial taxonomy containing manydocuments assigned to each category
They classified the results of a query in the classifierand then used a voting algorithm to classify the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings
For instance a query can be related to politics but also to news
There are few papers for ambiguity detection
Song et al estimated that about 16 of all queries areambiguous
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions
Early work used fixed limits of time to define sessionsbut this definition presents two problems
the session might be longer
the goals of the user in the session might be more than one
Hence it is good to distinguish
time based sessions queries of a user in a same session
missions sequence of queries with the same goal
Research missions an additional problem is thatmissions can span more than one session
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time
Different authors have found different thresholds ranging from fiveto sixty minutes
Query missions are a sequence of reformulatedqueries that express a same need
The detection of missions is a hard problem and hasbeen approached by several researchers
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query PropertiesQuery Difficulty
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty
For example one word queries are simpler than phrase queries
There are two different ways to measure difficulty
Post-retrieval predictions mechanisms run the query andanalyze its answer set
Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand
Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents
The intuition is that the top ranked results of anunambiguous query will be topically cohesive
Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms
Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used
Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists
If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback
Weighted Information Gain measures the change ininformation about the quality of retrieval between
an imaginary state that only an average document is retrieved(estimated by the collection model) and
a posterior state where the actual search results are observed
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem
The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel
From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime
The overlap between L and Lprime is used as prediction score
Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty
For example they either take into account
the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or
the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)
Kwok et al proposed measures based on the inversedocument frequency and the collection frequency
Queries with low frequency terms are predicted toachieve a better performance
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score
Query Scope uses the number of documents in thecollection that contain at least one of the query terms
Simplified Clarity Score (SCS) relies on termfrequencies
SCS(q) =sum
kiisinq
Pml(ki|q) times log2
(
Pml(ki|q)
Pcoll(ki)
)
where
Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q
Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection
AvPMI(q) =1
|(ki kℓ)|
sum
(kikℓ)isinq
log2
(
Pcoll(ki kℓ)
Pcoll(ki)Pcoll(kℓ)
)
where Pcoll(ki kℓ) is the probability that terms ki and kℓ
occur in the same document
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67
Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors
This is because the information available to them ismuch more general and much more sparse
Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors
Both approaches are computationally intensive
The technique proposed by He et al requires clustering
The technique proposed by Zhao et al requires determining thetfidf distribution of all documents
However both techniques are not efficient enough to beuseful for large collections
Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67