Server Ranking for Distributed Text Retrieval Systems on the Internet (Yuwono and Lee) presented by Travis Emmitt

Server Ranking for Distributed Text Retrieval Systems on the Internet

(Yuwono and Lee)

presented by Travis Emmitt

General Architecture

Coll_1 Coll_2 Coll_NColl_3

document relevant to Query_A document relevant to Query_B document relevant to Query_C

Broker_1 Broker_2 Broker_M

User_X

Query_A

User_Y User_Z

Query_B Query_CQuery_A

Need DF infofrom collections

Clones, created when needed

Terminology

• cooperating autonomous index servers

• collection fusion problem

• collections = databases = sites

• broker servers =?= meta search engines– collection of server-servers

• index servers = collection servers

• documents = information resources = texts

More Terminology

• words– before stemming and stopping– example: { the, computer, computing, French }

• terms– after stemming and stopping– example: { comput, French }

• keywords– meaning varies depending upon context

Subscripts

• Often see TFi,j and IDFj within the context of a single collection– In a multiple collection environment, this notational

shorthand can lead to ambiguity.

– Should instead use TFh,i,j and IDFh,j

• h, i, and j are identifiers [possibly integers]– ch is a collection

– doch,i is a document in collection ch

– th,j is a term in collection ch

More Terminology

• Nh = number of documents in collection ch

• Vh = vocabulary / set of all terms in ch

• Mh = number of terms in collection ch

– Mq = number of terms in query q

– Mh = |Vh|

TFh,i,j = Term Frequency

• Definition: number of times term th,j occurs in document doch,i

• gGLOSS assumes TFh,i,j = 0 or avgTFh,j

– avgTFh,j = Sumi=1Nh (TFh,i,j) / Nh

– TFs assumed to be identical for all documents in collection ch that contain one or more occurances of term tj

TFh,i,max = Maximum Term Frequency

• th,i,max = term occuring the most frequently in document doch,i

• TFh,i,max = number of times that term th,i,max occurs in document doch,i

• Example: doch,i = “Cat cat dog cat cat”

– th,i,max = “cat”

– TFh,i,max = 4

IDFh,j = Inverse Document Frequency

• DFh,j = document frequency

– number of docs in collection ch containing term th,j

• IDFh,j = 1 / DFh,j

– the literal interpretation of “inverse”

• IDFh,j = log (Nh / DFh,j)

– how it’s used– normalization technique

• Note: term th,j must appear in at least one document in the collection, or DFh,I will be 0 and IDFh,j will be undefined

Wh,i,j(scheme) = Term Weight

• Definition: the “weight” assigned to term th,j in document doch,i by a weighting scheme

• Wq,j(scheme) = the weight assigned to term tq,j in query q by a weighting scheme– We drop one subscript b/c queries don’t belong to collections,

unless you consider the set of queries to be a collection in itself [no one seems to do this]

• Note: for single term queries, weights might suffice

Wh,i,j(atn)• “atn” is a code representing choices made during a three part calculation

process [a,t,n]

• X = (0.5 + 0.5 TFh,i,j/TFh,i,max) -- the TF part

• Y = log (Nh/DFh,j) -- the IDF part

• Wh,i,j(atn) = X * Y

• Note: TFh,i,max might be the maximum term frequency in doch,i with the added constraint that the max term must occur in the query. If so, then X is dependent upon query composition and must therefore wait until query time to be calculated.

Wh,i,j(atc)• X = (0.5 + 0.5 TFh,i,j/TFh,i,max) -- the TF part

• Y = log (Nh/DFh,j) -- the IDF part

• Z = sqrt (Sumk=1MhX2 * Y2) -- normalization

• Wh,i,j(atc) = X * Y / Z

• atc is atn with vector-length normalization– atc is better for comparing long documents– atn is better for comparing short documents, and is cheaper to calculate

Query Time

• TFs, IDFs, and [possibly] Ws can be calculated prior to performing any queries.

• Queries are made up of one or more terms.– Some systems perceive queries as documents.

– Others seem them as sets of keywords.

• The job at query time job is to determine how well each document/collection “matches” a query. We calculate a similarity score for each document/collection relative to a query.

Sh,i,q(scheme) = Similarity Score

• Definition: estimated similarity of document doch,i to query q using a scheme

• Also called relevance score• Sh,i,q(scheme) = Sumj=1

Mq(Wh,i,j(scheme)*Wq,j(scheme)) -- Eq 1

• CVV assumes that Wq,j(scheme) = 1 for all terms tj that occur in query q, so:

– Sh,i,q(atn) = Sumj=1Mq ( Wh,i,j(atn) ) -- Eq 3

Ranking and Returningthe “Best” Documents

• Rank documents in descending order of similarity scores to the query.

• One method: get all docs with similarity scores above a specified threshold theta

• CVV retrieves top-H+ documents– Include all documents tied with Hth best document– Assume Hth best doc’s similarity score > 0

Multiple Collection Search• Also called collection selection

• In CVV, brokers need access to DFs– must be centralized, periodically updated

– all IDFs then provided to collection serversWhy?

1) “N is the number of texts in the database” [page 2]

2) “We refer to index servers as collection servers, as each of them can be viewed as a database carrying a collection of documents.” [page 2]

3) N and DF are both particular to a collection, so what extra-collection information is needed in Equation 3?

CVV = Cue-Validity Variance

• Also called the CVV ranking method

• Goodness can be derived completely from DFi,j and Ni

CVV Terminology

• C = set of collections in the system• |C| = number of collections in the system

• Ni = number of documents in collection ci

• DFi,j = # times term tj occurs incollection ci

or # documents in ci containing term tj

• CVi,j = cue-validity of term tj for collection ci

• CVVi,j = cue-validity variance of ti for ci

• Gi,q = goodness of collection ci to query q

CVV: Calculation

• A = DFi,j/Ni

• B = Sumk!=i|C|(DFk,j) / Sumk!=i

|C|(Nk) ?= Sumk!=i

|C|(DFk,j / Nk)

• CVi,j = A / (A + B)

• avgCVj = Sumi=1|C| (CVi,j) / |C|

• CVVi,j = Sumi=1|C| (CVi,j - avgCVj)2 / |C|

• Gi,q = Sumj=1|C| (CVVj * DFi,j)

I assume that’s what they meant (that M = |C|)

Goodness• ...of a collection relative to a query

• Denoted Gi,q where i is a collection id, q is a query id

• Gi,q is a sum of scores, over all terms in the query

• Each score represents how well term qj characterizes collection ci [ i is a collection id, j is a term id ]

• Gi,q = Sumj=1M (CVVj * DFi,j)

• The collection with the highest Goodness is the “best” [most relevant] collection for this query

Goodness: Example

• Query_A = “cat dog”

• q1 = cat, q2 = dog, M = |q| = 2

• You can look at this as [w/ user-friend subscripts]: GColl_1,Query_A = scoreColl_1,cat + scoreColl_1,dog

GColl_2,Query_A = scoreColl_2,cat + scoreColl_2,dog

...

• Note: The authors overload the identifier q. At times it represesents a query id [see Equation 1]. At other times, it represents a set [bag?] of terms: {qi, i from 1 to M}.

Query Term Weights

• What if Query_A = “cat cat dog”?– Do we allow this? Should we weigh cat more

heavily than dog? If so, how?

• Example: scoreColl_1,cat=10, scoreColl_1,dog=5 scoreColl_2,cat= 5, scoreColl_2,dog=11

– Intuitively, Coll_1 is more relevant to Query_A• Scores might be computed prior to processing a query

– get all collections’ scores for all terms in the vocab

– add appropriate pre-computed scores when given a query

QTW: CVV Assumptions

• The authors are concerned primarily with Internet queries [unlike us].

• They assume [based on their observations of users’ query tendencies] that terms appear at most once in a query.

• Their design doesn’t support query term weights, only cares whether a term is present in the query.

• Their design cannot be used to easily “find me documents like this one”.

QTW: Approach #1

• Approach #1 : q1=cat, q2=dog

– Ignore duplicates.– Results in a “binary term vector”.

– GColl_1,Query_A = 10 + 5 = 15GColl_2,Query_A = 5 + 11 = 16 -- top guess

– Here we see their algorithm would consider Coll_2 to be more relevant than Coll_1, which is counter to our intuition.

QTW: Approach #2

• Approach #2 : q1=cat, q2=cat, q3=dog

– You need to make q be a bag [allowing duplicate elements] instead of a set [doesn’t allow dups]

– GColl_1,Query_A = 10 + 10 + 5 = 25 -- top guessGColl_2,Query_A = 5 + 5 + 11 = 21

– Results in the “correct” answer.– Easy to implement once you have a bag set up.– However, primitive brokers will have to calculate

[or locate if pre-calculated] cat’s scores twice.

QTW: Approach #3

• Approach #3 : q1=cat, q2=dog, w1=2, w2=1

– The “true” term vector approach.

– GColl_1,Query_A = 10*2 + 10*1 = 25 -- top guessGColl_2,Query_A = 5*2 + 11*1 = 21

– Results in “correct” answer.– Don’t need to calculate scores multiple times.– If query term weights tend to be:

• >1 -- you save space: [cat,50] instead of fifty “cat cat...”

• almost all are 1 -- less efficient

QTW: Approach #3 (cont)

• Approach #3 -- the most TREC-friendly– TREC queries often have duplicate terms– Approach #3 results in “correct” answers and is

more efficient than Approach #2

• #3 sometimes better for WWW search:– “Find me more docs like this” -- doc similarities– Iterative search engines can use query term

weights to hone queries [example on next page]– Possibility of negative term weights [see example]

QTW: Iterative Querying (example)

• Query_1: “travis(5) emmitt(5) football(5)”– results in lots on Emmitt Smith, nothing on Travis– User tells engine that “emmitt smith” is irrelevant– Engine adjusts each query term weight in the “black

list” by -1, then performs a revised query:

• Query_2: “travis(5) emmitt(4) football(5) smith(-1)”

– Hopefully yields less Emmitt Smith, more Travis– Repeat cycle of user feedback, weight tweaking, &

requerying until the user is satisfied [or gives up]

• Can’t do this easily without term weights

QTW: User Profiles

• Might also have user profiles:– Allison loves cats, hates XXX, likes football– Her profile: cats(+3), XXX(-3), football(+1)– Adjustments made to every query she issues.

• Issues: “wearing different hats”, relying on keywords, want sensitivity to context:– “XXX” especially bad when JPEGs present– “XXX” not bad when in source code: “XXX:=1;”

QTW: Conclusion

• The bottom line is that query term weights can be useful, not just in a TREC scenario but in an Internet search scenario.

• CVV can probably be changed to support query term weights [might’ve already been]

• The QTW discussion was included mostly as a segue to interesting, advanced issues: iterative querying, user profiles, context.

Query Forwarding

• Single-Cast approach– Get documents from best collection only.– Fast and simple. No result merging.– Question: How often will this in fact suffice?

• Multi-Cast approach– Get documents from best n collections.– Slower, requires result merging.– Desired if best collection isn’t complete.

Result Merging• local doc ranks -> global doc ranks

• ri,j = rank of document docj in collection ci

– Ambiguous when dealing with multiple queries and multiple similarity estimation schemes [which is what we do].

– Should actually be ri,j,q(scheme)

• cmin,q = collection w/ least similarity to query q

• Gmin,q = goodness score of cmin,q relative to query q

Result Merging (cont)

• Di = estimated score distance between documents in ranks x and x+1– Di = Gmin, q / (H * Gi,q)

• si,j = 1 - (ri,j - 1) * Di

– global relevance score of the jth-ranked doc in ci

– need to re-rank documents globally

CVV Assumption #1

Assumption 1: The best document in collection ci is equally relevant to query q (has the same global score) as the best document in collection ck for any k != i and Gi,q, Gk,q > 0.

• Nitpick: if k=i, Gi,q= Gk,q, so no reason for k != i

CVV Assumption #1: Motivation

• They don’t want to require the same search algorithm at each site [collection server]. Sites will therefore tend to use different scales for Goodness; you can’t simply compare scores directly.

• They want a “collection containing a few but highly relevant documents to contribute to the final result.”

CVV Assumption #1: Critique

• What about collections with a few weak documents? Or a few redundant documents [that occur in other, “better” collections]?

• They omit collections with goodness scores less than half the highest goodness score– The best document could exist by itself in an otherwise

lame collection. The overall Goodness for that collection might be lower than half the max (since doc scores are used).

CVV Assumption #2

Assumption 2: The distance, in terms of absolute relevance score difference, between two consecutive document ranks in the result set of a collection is inversely proportional to the goodness score of the collection.

CVV vs gGLOSS

• Their characterization of gGLOSS:– “a keyword based distributed database broker system”

– “relies on the weight-sum of every term in a collection.”

– assumes that within a collection ci all docs contains either 0 or avgTFh,i,j occurances of term t

– assumes document weight computed similarly in all collections

– Y = log (N^ / DF^j) where N^ and DF^ are “global”

Performance Comparisons

• Accuracy calculated from cosine of the angle between the estimated goodness vector and a baseline goodness vector.– Based on the top H+– Independent of precision and recall.

• They of course say that CVV is best– gGLOSS appears much better than CORI (!)