67
Modern Information Retrieval Chapter 7 Queries: Languages & Properties Query Languages Query Properties Queries: Languages & Properties, Modern Information Retrieval, Addison Wesley, 2010 – p. 1

Chapter 7 Queries: Languages & Properties

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Modern Information Retrieval

Chapter 7

Queries Languages amp Properties

Query LanguagesQuery Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 1

Queries Languages amp PropertiesThis chapter covers the main aspects of queriesincluding

the different languages used to express them

their distribution and approaches for analyzing them with focus onthe Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 2

Query Languages

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 3

Query LanguagesWe cover now the different kind of queries normallyposed to text retrieval systems

This is in part dependent on the retrieval model thesystem adopts

That is a full-text system will not answer the same kind of queriesas those answered by a system based on keyword ranking

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 4

Query LanguagesThere is a difference between information retrievaland data retrieval

Languages for information retrieval allow the answer tobe ranked

For query languages not aimed at information retrievalthe concept of ranking cannot be easily defined

We consider these languages as languages for data retrieval

Some query languages are not intended for final users

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 5

Query LanguagesThere are a number of techniques to enhance theusefulness of the queries

Some examples are the expansion of a word to the setof its synonyms or the use of a thesaurus

Some words which are very frequent and do not carrymeaning (called stopwords) may be removed

We refer to words that can be used to match queryterms as keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 6

Query LanguagesAnother issue is the subject of the retrieval unit theinformation retrieval system adopts

The retrieval unit is the basic element which can beretrieved as an answer to a query

We call the retrieval units simply documents even ifthis reference can be used with different meanings

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 7

Query LanguagesKeyword Based Querying

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8

Keyword Based QueryingA query is the formulation of a user information need

Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking

However a query can also be a more complexcombination of operations involving several words

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9

Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word

Some models are also able to see the internal divisionof words into letters

In this case the alphabet is split into letters and separators

A word is a sequence of letters surrounded by separators

The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10

Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query

Further the resulting documents are ranked accordingto the degree of similarity with respect to the query

To support ranking two common statistics on wordoccurrences inside texts are commonly used

The first is called term frequency and counts the number oftimes a word appears inside a document

The second is called inverse document frequency and countsthe number of documents in which a word appears

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11

Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form

In this case a document matches a query only if it contains allthe words in the query

This is useful when the number of results for one singleword is too large

Additionally it may be required that the exact positionsin which a word occurs in the text should be provided

This might be useful for highlighting word occurrencesin snippets for instance during the display of results

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12

Context QueriesMany systems complement queries with the ability tosearch words in a given context

Words which appear near each other may signal higherlikelihood of relevance than if they appear apart

We may want to form phrases of words or find wordswhich are proximal in the text

Phrase

Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words

Proximity

Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13

Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators

A boolean query has a syntax composed of

atoms basic queries that retrieve documents

boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents

This scheme is in general compositional operatorscan be composed over the results of other operators

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Queries Languages amp PropertiesThis chapter covers the main aspects of queriesincluding

the different languages used to express them

their distribution and approaches for analyzing them with focus onthe Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 2

Query Languages

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 3

Query LanguagesWe cover now the different kind of queries normallyposed to text retrieval systems

This is in part dependent on the retrieval model thesystem adopts

That is a full-text system will not answer the same kind of queriesas those answered by a system based on keyword ranking

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 4

Query LanguagesThere is a difference between information retrievaland data retrieval

Languages for information retrieval allow the answer tobe ranked

For query languages not aimed at information retrievalthe concept of ranking cannot be easily defined

We consider these languages as languages for data retrieval

Some query languages are not intended for final users

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 5

Query LanguagesThere are a number of techniques to enhance theusefulness of the queries

Some examples are the expansion of a word to the setof its synonyms or the use of a thesaurus

Some words which are very frequent and do not carrymeaning (called stopwords) may be removed

We refer to words that can be used to match queryterms as keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 6

Query LanguagesAnother issue is the subject of the retrieval unit theinformation retrieval system adopts

The retrieval unit is the basic element which can beretrieved as an answer to a query

We call the retrieval units simply documents even ifthis reference can be used with different meanings

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 7

Query LanguagesKeyword Based Querying

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8

Keyword Based QueryingA query is the formulation of a user information need

Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking

However a query can also be a more complexcombination of operations involving several words

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9

Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word

Some models are also able to see the internal divisionof words into letters

In this case the alphabet is split into letters and separators

A word is a sequence of letters surrounded by separators

The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10

Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query

Further the resulting documents are ranked accordingto the degree of similarity with respect to the query

To support ranking two common statistics on wordoccurrences inside texts are commonly used

The first is called term frequency and counts the number oftimes a word appears inside a document

The second is called inverse document frequency and countsthe number of documents in which a word appears

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11

Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form

In this case a document matches a query only if it contains allthe words in the query

This is useful when the number of results for one singleword is too large

Additionally it may be required that the exact positionsin which a word occurs in the text should be provided

This might be useful for highlighting word occurrencesin snippets for instance during the display of results

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12

Context QueriesMany systems complement queries with the ability tosearch words in a given context

Words which appear near each other may signal higherlikelihood of relevance than if they appear apart

We may want to form phrases of words or find wordswhich are proximal in the text

Phrase

Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words

Proximity

Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13

Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators

A boolean query has a syntax composed of

atoms basic queries that retrieve documents

boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents

This scheme is in general compositional operatorscan be composed over the results of other operators

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query Languages

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 3

Query LanguagesWe cover now the different kind of queries normallyposed to text retrieval systems

This is in part dependent on the retrieval model thesystem adopts

That is a full-text system will not answer the same kind of queriesas those answered by a system based on keyword ranking

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 4

Query LanguagesThere is a difference between information retrievaland data retrieval

Languages for information retrieval allow the answer tobe ranked

For query languages not aimed at information retrievalthe concept of ranking cannot be easily defined

We consider these languages as languages for data retrieval

Some query languages are not intended for final users

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 5

Query LanguagesThere are a number of techniques to enhance theusefulness of the queries

Some examples are the expansion of a word to the setof its synonyms or the use of a thesaurus

Some words which are very frequent and do not carrymeaning (called stopwords) may be removed

We refer to words that can be used to match queryterms as keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 6

Query LanguagesAnother issue is the subject of the retrieval unit theinformation retrieval system adopts

The retrieval unit is the basic element which can beretrieved as an answer to a query

We call the retrieval units simply documents even ifthis reference can be used with different meanings

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 7

Query LanguagesKeyword Based Querying

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8

Keyword Based QueryingA query is the formulation of a user information need

Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking

However a query can also be a more complexcombination of operations involving several words

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9

Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word

Some models are also able to see the internal divisionof words into letters

In this case the alphabet is split into letters and separators

A word is a sequence of letters surrounded by separators

The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10

Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query

Further the resulting documents are ranked accordingto the degree of similarity with respect to the query

To support ranking two common statistics on wordoccurrences inside texts are commonly used

The first is called term frequency and counts the number oftimes a word appears inside a document

The second is called inverse document frequency and countsthe number of documents in which a word appears

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11

Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form

In this case a document matches a query only if it contains allthe words in the query

This is useful when the number of results for one singleword is too large

Additionally it may be required that the exact positionsin which a word occurs in the text should be provided

This might be useful for highlighting word occurrencesin snippets for instance during the display of results

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12

Context QueriesMany systems complement queries with the ability tosearch words in a given context

Words which appear near each other may signal higherlikelihood of relevance than if they appear apart

We may want to form phrases of words or find wordswhich are proximal in the text

Phrase

Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words

Proximity

Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13

Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators

A boolean query has a syntax composed of

atoms basic queries that retrieve documents

boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents

This scheme is in general compositional operatorscan be composed over the results of other operators

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query LanguagesWe cover now the different kind of queries normallyposed to text retrieval systems

This is in part dependent on the retrieval model thesystem adopts

That is a full-text system will not answer the same kind of queriesas those answered by a system based on keyword ranking

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 4

Query LanguagesThere is a difference between information retrievaland data retrieval

Languages for information retrieval allow the answer tobe ranked

For query languages not aimed at information retrievalthe concept of ranking cannot be easily defined

We consider these languages as languages for data retrieval

Some query languages are not intended for final users

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 5

Query LanguagesThere are a number of techniques to enhance theusefulness of the queries

Some examples are the expansion of a word to the setof its synonyms or the use of a thesaurus

Some words which are very frequent and do not carrymeaning (called stopwords) may be removed

We refer to words that can be used to match queryterms as keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 6

Query LanguagesAnother issue is the subject of the retrieval unit theinformation retrieval system adopts

The retrieval unit is the basic element which can beretrieved as an answer to a query

We call the retrieval units simply documents even ifthis reference can be used with different meanings

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 7

Query LanguagesKeyword Based Querying

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8

Keyword Based QueryingA query is the formulation of a user information need

Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking

However a query can also be a more complexcombination of operations involving several words

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9

Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word

Some models are also able to see the internal divisionof words into letters

In this case the alphabet is split into letters and separators

A word is a sequence of letters surrounded by separators

The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10

Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query

Further the resulting documents are ranked accordingto the degree of similarity with respect to the query

To support ranking two common statistics on wordoccurrences inside texts are commonly used

The first is called term frequency and counts the number oftimes a word appears inside a document

The second is called inverse document frequency and countsthe number of documents in which a word appears

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11

Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form

In this case a document matches a query only if it contains allthe words in the query

This is useful when the number of results for one singleword is too large

Additionally it may be required that the exact positionsin which a word occurs in the text should be provided

This might be useful for highlighting word occurrencesin snippets for instance during the display of results

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12

Context QueriesMany systems complement queries with the ability tosearch words in a given context

Words which appear near each other may signal higherlikelihood of relevance than if they appear apart

We may want to form phrases of words or find wordswhich are proximal in the text

Phrase

Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words

Proximity

Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13

Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators

A boolean query has a syntax composed of

atoms basic queries that retrieve documents

boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents

This scheme is in general compositional operatorscan be composed over the results of other operators

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query LanguagesThere is a difference between information retrievaland data retrieval

Languages for information retrieval allow the answer tobe ranked

For query languages not aimed at information retrievalthe concept of ranking cannot be easily defined

We consider these languages as languages for data retrieval

Some query languages are not intended for final users

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 5

Query LanguagesThere are a number of techniques to enhance theusefulness of the queries

Some examples are the expansion of a word to the setof its synonyms or the use of a thesaurus

Some words which are very frequent and do not carrymeaning (called stopwords) may be removed

We refer to words that can be used to match queryterms as keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 6

Query LanguagesAnother issue is the subject of the retrieval unit theinformation retrieval system adopts

The retrieval unit is the basic element which can beretrieved as an answer to a query

We call the retrieval units simply documents even ifthis reference can be used with different meanings

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 7

Query LanguagesKeyword Based Querying

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8

Keyword Based QueryingA query is the formulation of a user information need

Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking

However a query can also be a more complexcombination of operations involving several words

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9

Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word

Some models are also able to see the internal divisionof words into letters

In this case the alphabet is split into letters and separators

A word is a sequence of letters surrounded by separators

The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10

Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query

Further the resulting documents are ranked accordingto the degree of similarity with respect to the query

To support ranking two common statistics on wordoccurrences inside texts are commonly used

The first is called term frequency and counts the number oftimes a word appears inside a document

The second is called inverse document frequency and countsthe number of documents in which a word appears

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11

Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form

In this case a document matches a query only if it contains allthe words in the query

This is useful when the number of results for one singleword is too large

Additionally it may be required that the exact positionsin which a word occurs in the text should be provided

This might be useful for highlighting word occurrencesin snippets for instance during the display of results

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12

Context QueriesMany systems complement queries with the ability tosearch words in a given context

Words which appear near each other may signal higherlikelihood of relevance than if they appear apart

We may want to form phrases of words or find wordswhich are proximal in the text

Phrase

Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words

Proximity

Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13

Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators

A boolean query has a syntax composed of

atoms basic queries that retrieve documents

boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents

This scheme is in general compositional operatorscan be composed over the results of other operators

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query LanguagesThere are a number of techniques to enhance theusefulness of the queries

Some examples are the expansion of a word to the setof its synonyms or the use of a thesaurus

Some words which are very frequent and do not carrymeaning (called stopwords) may be removed

We refer to words that can be used to match queryterms as keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 6

Query LanguagesAnother issue is the subject of the retrieval unit theinformation retrieval system adopts

The retrieval unit is the basic element which can beretrieved as an answer to a query

We call the retrieval units simply documents even ifthis reference can be used with different meanings

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 7

Query LanguagesKeyword Based Querying

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8

Keyword Based QueryingA query is the formulation of a user information need

Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking

However a query can also be a more complexcombination of operations involving several words

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9

Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word

Some models are also able to see the internal divisionof words into letters

In this case the alphabet is split into letters and separators

A word is a sequence of letters surrounded by separators

The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10

Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query

Further the resulting documents are ranked accordingto the degree of similarity with respect to the query

To support ranking two common statistics on wordoccurrences inside texts are commonly used

The first is called term frequency and counts the number oftimes a word appears inside a document

The second is called inverse document frequency and countsthe number of documents in which a word appears

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11

Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form

In this case a document matches a query only if it contains allthe words in the query

This is useful when the number of results for one singleword is too large

Additionally it may be required that the exact positionsin which a word occurs in the text should be provided

This might be useful for highlighting word occurrencesin snippets for instance during the display of results

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12

Context QueriesMany systems complement queries with the ability tosearch words in a given context

Words which appear near each other may signal higherlikelihood of relevance than if they appear apart

We may want to form phrases of words or find wordswhich are proximal in the text

Phrase

Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words

Proximity

Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13

Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators

A boolean query has a syntax composed of

atoms basic queries that retrieve documents

boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents

This scheme is in general compositional operatorscan be composed over the results of other operators

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query LanguagesAnother issue is the subject of the retrieval unit theinformation retrieval system adopts

The retrieval unit is the basic element which can beretrieved as an answer to a query

We call the retrieval units simply documents even ifthis reference can be used with different meanings

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 7

Query LanguagesKeyword Based Querying

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8

Keyword Based QueryingA query is the formulation of a user information need

Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking

However a query can also be a more complexcombination of operations involving several words

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9

Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word

Some models are also able to see the internal divisionof words into letters

In this case the alphabet is split into letters and separators

A word is a sequence of letters surrounded by separators

The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10

Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query

Further the resulting documents are ranked accordingto the degree of similarity with respect to the query

To support ranking two common statistics on wordoccurrences inside texts are commonly used

The first is called term frequency and counts the number oftimes a word appears inside a document

The second is called inverse document frequency and countsthe number of documents in which a word appears

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11

Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form

In this case a document matches a query only if it contains allthe words in the query

This is useful when the number of results for one singleword is too large

Additionally it may be required that the exact positionsin which a word occurs in the text should be provided

This might be useful for highlighting word occurrencesin snippets for instance during the display of results

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12

Context QueriesMany systems complement queries with the ability tosearch words in a given context

Words which appear near each other may signal higherlikelihood of relevance than if they appear apart

We may want to form phrases of words or find wordswhich are proximal in the text

Phrase

Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words

Proximity

Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13

Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators

A boolean query has a syntax composed of

atoms basic queries that retrieve documents

boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents

This scheme is in general compositional operatorscan be composed over the results of other operators

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query LanguagesKeyword Based Querying

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 8

Keyword Based QueryingA query is the formulation of a user information need

Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking

However a query can also be a more complexcombination of operations involving several words

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9

Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word

Some models are also able to see the internal divisionof words into letters

In this case the alphabet is split into letters and separators

A word is a sequence of letters surrounded by separators

The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10

Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query

Further the resulting documents are ranked accordingto the degree of similarity with respect to the query

To support ranking two common statistics on wordoccurrences inside texts are commonly used

The first is called term frequency and counts the number oftimes a word appears inside a document

The second is called inverse document frequency and countsthe number of documents in which a word appears

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11

Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form

In this case a document matches a query only if it contains allthe words in the query

This is useful when the number of results for one singleword is too large

Additionally it may be required that the exact positionsin which a word occurs in the text should be provided

This might be useful for highlighting word occurrencesin snippets for instance during the display of results

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12

Context QueriesMany systems complement queries with the ability tosearch words in a given context

Words which appear near each other may signal higherlikelihood of relevance than if they appear apart

We may want to form phrases of words or find wordswhich are proximal in the text

Phrase

Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words

Proximity

Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13

Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators

A boolean query has a syntax composed of

atoms basic queries that retrieve documents

boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents

This scheme is in general compositional operatorscan be composed over the results of other operators

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Keyword Based QueryingA query is the formulation of a user information need

Keyword based queries are popular since they areintuitive easy to express and allow for fast ranking

However a query can also be a more complexcombination of operations involving several words

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 9

Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word

Some models are also able to see the internal divisionof words into letters

In this case the alphabet is split into letters and separators

A word is a sequence of letters surrounded by separators

The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10

Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query

Further the resulting documents are ranked accordingto the degree of similarity with respect to the query

To support ranking two common statistics on wordoccurrences inside texts are commonly used

The first is called term frequency and counts the number oftimes a word appears inside a document

The second is called inverse document frequency and countsthe number of documents in which a word appears

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11

Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form

In this case a document matches a query only if it contains allthe words in the query

This is useful when the number of results for one singleword is too large

Additionally it may be required that the exact positionsin which a word occurs in the text should be provided

This might be useful for highlighting word occurrencesin snippets for instance during the display of results

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12

Context QueriesMany systems complement queries with the ability tosearch words in a given context

Words which appear near each other may signal higherlikelihood of relevance than if they appear apart

We may want to form phrases of words or find wordswhich are proximal in the text

Phrase

Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words

Proximity

Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13

Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators

A boolean query has a syntax composed of

atoms basic queries that retrieve documents

boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents

This scheme is in general compositional operatorscan be composed over the results of other operators

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Word QueriesThe most elementary query that can be formulated in atext retrieval system is the word

Some models are also able to see the internal divisionof words into letters

In this case the alphabet is split into letters and separators

A word is a sequence of letters surrounded by separators

The division of the text into words is not arbitrary sincewords carry a lot of meaning in natural language

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 10

Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query

Further the resulting documents are ranked accordingto the degree of similarity with respect to the query

To support ranking two common statistics on wordoccurrences inside texts are commonly used

The first is called term frequency and counts the number oftimes a word appears inside a document

The second is called inverse document frequency and countsthe number of documents in which a word appears

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11

Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form

In this case a document matches a query only if it contains allthe words in the query

This is useful when the number of results for one singleword is too large

Additionally it may be required that the exact positionsin which a word occurs in the text should be provided

This might be useful for highlighting word occurrencesin snippets for instance during the display of results

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12

Context QueriesMany systems complement queries with the ability tosearch words in a given context

Words which appear near each other may signal higherlikelihood of relevance than if they appear apart

We may want to form phrases of words or find wordswhich are proximal in the text

Phrase

Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words

Proximity

Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13

Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators

A boolean query has a syntax composed of

atoms basic queries that retrieve documents

boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents

This scheme is in general compositional operatorscan be composed over the results of other operators

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Word QueriesThe result of word queries is the set of documentscontaining at least one of the words of the query

Further the resulting documents are ranked accordingto the degree of similarity with respect to the query

To support ranking two common statistics on wordoccurrences inside texts are commonly used

The first is called term frequency and counts the number oftimes a word appears inside a document

The second is called inverse document frequency and countsthe number of documents in which a word appears

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 11

Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form

In this case a document matches a query only if it contains allthe words in the query

This is useful when the number of results for one singleword is too large

Additionally it may be required that the exact positionsin which a word occurs in the text should be provided

This might be useful for highlighting word occurrencesin snippets for instance during the display of results

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12

Context QueriesMany systems complement queries with the ability tosearch words in a given context

Words which appear near each other may signal higherlikelihood of relevance than if they appear apart

We may want to form phrases of words or find wordswhich are proximal in the text

Phrase

Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words

Proximity

Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13

Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators

A boolean query has a syntax composed of

atoms basic queries that retrieve documents

boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents

This scheme is in general compositional operatorscan be composed over the results of other operators

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Word QueriesThe other possibility of interpreting queries popularizedby Web search engines is the conjunctive form

In this case a document matches a query only if it contains allthe words in the query

This is useful when the number of results for one singleword is too large

Additionally it may be required that the exact positionsin which a word occurs in the text should be provided

This might be useful for highlighting word occurrencesin snippets for instance during the display of results

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 12

Context QueriesMany systems complement queries with the ability tosearch words in a given context

Words which appear near each other may signal higherlikelihood of relevance than if they appear apart

We may want to form phrases of words or find wordswhich are proximal in the text

Phrase

Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words

Proximity

Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13

Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators

A boolean query has a syntax composed of

atoms basic queries that retrieve documents

boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents

This scheme is in general compositional operatorscan be composed over the results of other operators

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Context QueriesMany systems complement queries with the ability tosearch words in a given context

Words which appear near each other may signal higherlikelihood of relevance than if they appear apart

We may want to form phrases of words or find wordswhich are proximal in the text

Phrase

Is a sequence of single-word queriesAn occurrence of the phrase is a sequence of wordsCan be ranked in a fashion somewhat analogous to single words

Proximity

Is a more relaxed version of the phrase queryA maximum allowed distance between single words or phrases is givenThe ranking technique can be depend on physical proximity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 13

Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators

A boolean query has a syntax composed of

atoms basic queries that retrieve documents

boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents

This scheme is in general compositional operatorscan be composed over the results of other operators

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Boolean QueriesThe oldest way to combine keyword queries is to useboolean operators

A boolean query has a syntax composed of

atoms basic queries that retrieve documents

boolean operators work on their operands (which are sets ofdocuments) and deliver sets of documents

This scheme is in general compositional operatorscan be composed over the results of other operators

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 14

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Boolean QueriesA query syntax tree is naturally defined

Consider the example of a query syntax tree below

syntax syntactic

ORtranslation

AND

It will retrieve all the documents which contain the wordtranslation as well as either the word syntax or theword syntactic

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 15

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Boolean QueriesThe operators most commonly used given two basicqueries or boolean sub-expressions e1 and e2 are

e1 OR e2 the query selects all documents which satisfy e1 or e2

e1 AND e2 selects all documents which satisfy both e1 and e2

e1 BUT e2 selects all documents which satisfy e1 but not e2

NOT e2 the query selects all documents which not contain e2

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 16

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Boolean QueriesWith classic boolean systems no ranking of theretrieved documents is provided

A document either satisfies the boolean query or it does not

This is quite a limitation because it does not allow forpartial matching between a document and a user query

To overcome this limitation the condition for retrievalmust be relaxed

For instance a document which partially satisfies an ANDcondition might be retrieved

The NOT operator is usually not used alone as thecomplement of a set of documents is the rest of thedocument collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 17

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Boolean QueriesA fuzzy-boolean set of operators has been proposed

The idea is that the meaning of AND and OR can berelaxed so that they retrieve more documents

The documents are ranked higher when they have alarger number of elements in common with the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 18

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query LanguagesBeyond Keywords

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 19

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Pattern MatchingA pattern is a set of syntactic features that must befound in a text segment

Those segments satisfying the pattern specificationsare said to match the pattern

We can search for documents containing segmentswhich match a given search pattern

Each system allows specifying some types of patterns

The more powerful the set of patterns allowed the moreinvolved queries can the user formulate in general

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 20

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Pattern MatchingThe most used types of patterns are

Words a string which must be a word in the text

Prefixes a string which must form the beginning of a text word

Suffixes a string which must form the termination of a text word

Substrings a string which can appear within a text word

Ranges a pair of strings which matches any word whichlexicographically lies between them

Allowing errors a word together with an error threshold

Regular expressions a rather general pattern built up by simplestrings

Extended patterns a more user-friendly query language torepresent some common cases of regular expressions

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 21

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Natural LanguagePushing the fuzzy model even further the distinctionbetween AND and OR could be completely blurred

In this case a query becomes simply an enumeration ofwords and context queries

The negation can be handled by letting the userexpress that some words are not desired

Then the documents containing them are penalized in the rankingcomputation

Under this scheme we have completely eliminated anyreference to boolean operations and entered into thefield of natural language queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 22

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query LanguagesStructural Queries

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 23

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Structural QueriesThe text collections tend to have some structure builtinto them

The standardization of languages to represent structured textshas pushed forward in this direction

Mixing contents and structure in queries allows posingvery powerful queries

Queries can be expressed using containment proximityor other restrictions on the structural elements

More details in Chapter 13 on Structured Text Retrieval

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 24

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Structural QueriesThe three main types of structures

a) form-like fixed structureb) hypertext structurec) hierarchical structure

a) b) c)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 25

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Fixed StructureThe structure allowed in texts was traditionally quiterestrictive

The documents had a fixed set of fields and each fieldhad some text inside

Some fields were not present in all documents

Some documents could have text not classified under any field

They were not allowed to nest or overlap

Retrieval activity allowed specifying that a given basicpattern was to be found only in a given field

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 26

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Fixed StructureWhen the structure is very rigid the content of somefields can be interpreted as numbers dates etc

This idea leads naturally to the relational model eachfield corresponding to a column in the database table

There are several proposals that extend SQL to allowfull-text retrieval

Among them we can mention proposals by the leading relationaldatabase vendors such as Oracle and Sybase as well as SFQL

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 27

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Fixed StructureHypertexts probably represent the opposite trend withrespect to structuring power

Retrieval from hypertext began as a merely navigationalactivity

That is the user had to manually traverse the hypertextnodes following links to search what heshe wanted

Some query tools allow querying hypertext based ontheir content and their structure

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 28

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Fixed StructureAn intermediate model which lies between fixedstructure and hypertext is the hierarchical structure

An example of a hierarchical structure the page of abook and its schematic view

Chapter 6

We cover in this chapter the different kind of

61 Keyword Based

63 Structured Queries

chapter

section section

title title figure

We cover Keyword Based Structural

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 29

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Fixed StructureAn example of a query to the hierarchical structurepresented

IN

figure WITH

WITHsection

title ldquostructuralrdquo

This parsed query returns the image below

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 30

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query LanguagesQuery Protocols

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 31

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query ProtocolsSometimes query languages are used by applicationsto query text databases

Because they are not intended for human use we referto them as protocols rather than languages

The most important arewere

Z3950

Wide Area Information Service (WAIS)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 32

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query ProtocolsIn the CD-ROM publishing arena there are severalproposals for query protocols

The main goal of these protocols is to provide discinterchangeability

We can cite three of them

Common Command Language (CCL)

Compact Disk Read only Data exchange (CD-RDx)

Structured Full-text Query Language (SFQL)

SFQL is based on SQL and also has a client-serverarchitecture

The language does not define any specific formatting ormarkup

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 33

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query ProtocolsFor example a query in SFQL is

Select abstract from journalpaperswhere title contains text search

The language supports boolean and logical operatorsthesaurus proximity operations and some specialcharacters as wild-cards and repetition

For example where paper contains retrievalor like info and date gt 1198

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 34

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query Properties

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 35

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Characterizing Web QueriesThe notion of information seeking encompasses abroad range of information needs

People have different search needs at different timesand in different contexts

A number of researchers have attempted to classify andtally the types of information needs

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 36

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Characterizing Web QueriesMost web search engines record information aboutthe queries

This information includes the query itself the time and the IPaddress

Some systems also record which search results were clickedon for a given query

These logs are a valuable resource for understandingthe kinds of information needs that users have

The first studies concentrated on basic statistics

query occurrences

term occurrences

query length

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 37

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Characterizing Web QueriesJansen amp Spink claim a trend of a decreasingpercentage of one-word queries

Jansen et al found that the percentage of three wordqueries increased from 28 in 1998 to 49 in 2002

In May 2005 Jansen et al conducted a study using15M queries gathered from the Dogpile search engine

This study found that the mean length of the queries was 28terms with the longest query having 25 terms

Results for a larger data set 185M queries of YahooUK in 2007 were presented by Skobeltsyn et al

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 38

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Characterizing Web QueriesTable below shows the distribution of query lengths forthese two last references

Length Dogpile (2005) Yahoo (2007)

1 18 25

2 32 35

3 25 23

4 13 11

5 6 4

6 3 1

gt7 3 1

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 39

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Characterizing Web QueriesQueries as words in a text follow a biased distribution

In fact the frequency of query words follow a Zipfrsquos lawwith parameter α

The value of α ranges from 06 to 14 perhaps due to languageand cultural differences

However this is less biased than Web text where α iscloser to 2

The standard correlation among the frequency of aword in the Web pages and in the queries also varies

These values range from 015 to 042

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 40

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Characterizing Web QueriesThis implies that what people search is different fromwhat people publish in the Web

Relative query frequency vs relative documentfrequency of each word in a vocabulary

1e-07

1e-06

1e-05

1e-04

0001

001

01

1

1e-06 1e-05 1e-04 0001 001 01 1

Que

ry fr

eque

ncy

Document frequency

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 41

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Characterizing Web QueriesSome logs also registers the number of answer pagesseen and the pages selected after a search

Many people refines the query adding and removingwords but most of them see very few answer pages

The table below shows query statistics for four differentsearch engines

Measure AltaVista Excite AlltheWeb TodoCL(1998) (2001) (2001) (2002)

Words per query 24 26 23 11

Queries per user 20 23 29 ndash

Answer pages per query 13 17 22 12

Boolean queries lt40 10 ndash 16

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 42

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Characterizing Web QueriesIn addition the average number of pages clicked peranswer can be low due to navigational queries (around2 clicks per query)

Further studies have shown that the focus of thequeries has shifted from leisure to e-commerce (moredetails later in the section about Query Topics)

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 43

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query PropertiesUser Search Behavior

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 44

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

User Search BehaviorState diagram of user behavior in a small search engine

024

075

013

006

086

009

006

062

00601

017

075

015

091

006

009

041

017

008

099

022

027

URL

035

gt02

005gt=xgt001

02gt=xgt005

009

00

Web

003

001

058

006

009

011

Home Page

Advanced search

Directory

Selected

Results

Refining

End session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 45

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query IntentUntil the Web the concept of a user query wasassociated with searching for information of interest

The design of the search tool was always targeted tohelp the user write good queries

This implies that the query language adopted was usuallycomplex

The Web changed this drastically as users started touse search engines not only to find information but alsoto achieve other goals

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 46

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query IntentBroderrsquos taxonomy of Web search goals

Navigational The immediate intent is to reach a particular site(245 survey 20 query log)

Informational The intent is to acquire some information (39survey 48 query log)

Transactional The intent is to perform some web-mediatedactivity (36 survey 30 query log)

This taxonomy has been heavily influential indiscussions of query types on the Web

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 47

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query IntentRose amp Levinson developed a taxonomy that extendedand changed somewhat from Broderrsquos

They noted that much of what happens on the Web isthe acquisition and consumption of online resources

Thus they replace Broderrsquos transactions category with abroader category of resources

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 48

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query IntentRecent research have dealt with the automaticprediction of these classes

These works use machine learning over different queryattributes such as

Anchor-text distribution of the words in the queries

Past click behavior

The main problem is that many queries are inherentlyambiguous

They can be classified in more than one class when the contextof the search is not known

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 49

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query TopicQueries can also be classified according to the topic ofthe query independently of the query intent

For example a search involving the topic of weathercan consist of

the simple information need of looking at todayrsquos forecast or

the rich and complex information need of studying meteorology

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 50

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query TopicSpink amp Jansen et al have manually analyzed samplesof query logs

The authors found that queries relating to sex and pornographydeclined from 168 in 1997 to just 36 in 2005

On the other hand commerce-related queries increased from13 to almost 25 in the same years

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 51

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query TopicChanges in Excite topics from 1997 to 2001 ()

Rank Topic 1997 2001

1 Commerce travel employment or economy 133 247

2 People places or things 67 197

3 Non-English or unknown 41 113

4 Computers or Internet 125 96

5 Sex or pornography 168 85

6 Health or sciences 95 75

7 Entertainment or recreation 199 66

8 Education or humanities 56 45

9 Society culture ethnicity or religion 54 39

10 Government 34 20

11 Performing or fine arts 54 11

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 52

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query TopicShen et al mapped the results of a search engine to theset of categories in the Open Directory Project

Their results F-score of about 45 on 63 categories

The table below shows the results for five queries

Query Top category Second category

chat rooms ComputersInternet Online CommunityChat

lake michigan lodges InfoLocal amp Regional LivingTravel amp Vacation

stephen hawking InfoScience amp Tech InfoArts amp Humanities

dog shampoo ShoppingBuying Guides LivingPets amp Animals

text mining ComputersSoftware InformationCompanies

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 53

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query TopicBroder et al presented a method for classifying shortrare queries into a taxonomy of 6000 categories

This is an important problem because rare or infrequent queriesare approximately half of all queries

Training data a commercial taxonomy containing manydocuments assigned to each category

They classified the results of a query in the classifierand then used a voting algorithm to classify the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 54

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query TopicAmbiguous queries are those queries that can have twoor more distinct meanings

For instance a query can be related to politics but also to news

There are few papers for ambiguity detection

Song et al estimated that about 16 of all queries areambiguous

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 55

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query Sessions and MissionsOne important problem when analyzing queries isdetermining user query sessions

Early work used fixed limits of time to define sessionsbut this definition presents two problems

the session might be longer

the goals of the user in the session might be more than one

Hence it is good to distinguish

time based sessions queries of a user in a same session

missions sequence of queries with the same goal

Research missions an additional problem is thatmissions can span more than one session

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 56

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query Session BoundariesOne alternative to determine sessions more accuratelyis to establish a maximum inactivity time

Different authors have found different thresholds ranging from fiveto sixty minutes

Query missions are a sequence of reformulatedqueries that express a same need

The detection of missions is a hard problem and hasbeen approached by several researchers

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 57

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query PropertiesQuery Difficulty

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 58

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Query DifficultyAnother important characteristic of queries is theirintrinsic difficulty

For example one word queries are simpler than phrase queries

There are two different ways to measure difficulty

Post-retrieval predictions mechanisms run the query andanalyze its answer set

Pre-retrieval mechanisms evaluate the difficulty withoutexecuting the query

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 59

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Post-retrieval AlgorithmsPost-retrieval algorithms are more diverse as they havemore information at hand

Clarity Score relies on the difference between thelanguage models of the collection and of the topretrieved documents

The intuition is that the top ranked results of anunambiguous query will be topically cohesive

Term distribution of an ambiguous query is assumed tobe more similar to the collection distribution

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 60

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Post-retrieval AlgorithmsYom-Tov et al compared the ranked list of the originalquery with the ranked lists of the queryrsquos terms

Intuition for well performing queries the results do not changeconsiderably if only a subset of query terms is used

Aslam et al a query is considered to be difficult ifdifferent ranking functions retrieve diverse ranked lists

If the overlap between the top ranked documents is large acrossall ranked lists the query is deemed to be easy

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 61

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Post-retrieval AlgorithmsZhou and Croft investigated two approaches WeightedInformation Gain and Query Feedback

Weighted Information Gain measures the change ininformation about the quality of retrieval between

an imaginary state that only an average document is retrieved(estimated by the collection model) and

a posterior state where the actual search results are observed

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 62

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Post-retrieval AlgorithmsQuery Feedback frames query prediction as acommunication channel problem

The input is query Q the channel is the retrieval system and theranked list L is the noisy output of the channel

From L a new query Qprime is generated a second ranking Lprime isretrieved with Qprime

The overlap between L and Lprime is used as prediction score

Hauff et al provides an evaluation of these techniquesand propose a new technique named Improved Clarity

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 63

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Pre-retrieval AlgorithmsPre-retrieval algorithms must rely on the collectionstatistics of the query terms to predict query difficulty

For example they either take into account

the frequencies of the query terms in the collection such asAveraged IDF or Simplified Clarity Score or

the co-occurrence of query terms in the collection such asAveraged Pointwise Mutual Information (PMI)

Kwok et al proposed measures based on the inversedocument frequency and the collection frequency

Queries with low frequency terms are predicted toachieve a better performance

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 64

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Pre-retrieval AlgorithmsHe et al evaluated a number of algorithms includingQuery Scope and Simplified Clarity Score

Query Scope uses the number of documents in thecollection that contain at least one of the query terms

Simplified Clarity Score (SCS) relies on termfrequencies

SCS(q) =sum

kiisinq

Pml(ki|q) times log2

(

Pml(ki|q)

Pcoll(ki)

)

where

Pml(ki|q) is the maximum likelihood estimator of term ki givenquery q

Pcoll(ki) is the term count of ki in the collection divided by thetotal number of terms in the collection

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 65

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Pre-retrieval AlgorithmsAveraged PMI measures the average mutualinformation of two query terms in the collection

AvPMI(q) =1

|(ki kℓ)|

sum

(kikℓ)isinq

log2

(

Pcoll(ki kℓ)

Pcoll(ki)Pcoll(kℓ)

)

where Pcoll(ki kℓ) is the probability that terms ki and kℓ

occur in the same document

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 66

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms

Pre-retrieval AlgorithmsPre-retrieval predictors usually have a lower accuracycompared to post-retrieval predictors

This is because the information available to them ismuch more general and much more sparse

Nevertheless two pre-retrieval predictors achieveperformance equivalent to post-retrieval predictors

Both approaches are computationally intensive

The technique proposed by He et al requires clustering

The technique proposed by Zhao et al requires determining thetfidf distribution of all documents

However both techniques are not efficient enough to beuseful for large collections

Queries Languages amp Properties Modern Information Retrieval Addison Wesley 2010 ndash p 67

  • Queries Languages amp Properties
  • Query Languages
  • Query Languages
  • Query Languages
  • Query Languages
  • Keyword Based Querying
  • Word Queries
  • Word Queries
  • Word Queries
  • Context Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Boolean Queries
  • Pattern Matching
  • Pattern Matching
  • Natural Language
  • Structural Queries
  • Structural Queries
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Fixed Structure
  • Query Protocols
  • Query Protocols
  • Query Protocols
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • Characterizing Web Queries
  • User Search Behavior
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Intent
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Topic
  • Query Sessions and Missions
  • Query Session Boundaries
  • Query Difficulty
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Post-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms
  • Pre-retrieval Algorithms