32
1 Blog site search using resource selection 2008 ACM CIKM Advisor Dr. Koh Jia-Ling Speaker Chou-Bin Fan Date 2009.08.04

1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

Embed Size (px)

Citation preview

Page 1: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

1

Blog site search using resource selection

2008 ACM CIKM

Advisor: Dr. Koh Jia-Ling

Speaker: Chou-Bin Fan

Date: 2009.08.04

Page 2: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

2

Outline

• Introduction• Resource selection techniques for blog site search 1. Global Representation

2. Query Generation Maximization

3. Pseudo-Cluster based Selection

• Experiments• Customizing the search• Conclusion

Page 3: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

3

Introduction

• A blog site consists of many individual blog postings.

• Current blog search services focus on retrieving postings but there is also a need to identify relevant blog sites.

• Blog site search is similar to resource selection in distributed information retrieval, in that the target is to find relevant collections of documents.

Page 4: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

4

Introduction

• In this paper, we focus on search techniques for complete blogs rather than postings.

• Since the term “blog search” often means “posting search” we instead use the term “blog site search”.

• As an example of the dfference between blog site and blog posting searches, consider the following two queries:

Q1: “Nikon D3 review”

Q2: “digital camera reviews”

Page 5: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

5

Introduction

• Finding relevant blog sites can be regarded as selecting relevant collections from a number of collections, in that each blog site can be considered as a collection of postings.

• Thus, in this paper, we study how to apply resource selection techniques to blog site search and further suggest customized methods to improve retrieval performance.

Page 6: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

6

Resource selection techniques for blog site search

• Resource selection in distributed information retrieval is used to select the most relevant collections from a large number of possible collections.

• We can employ existing resource selection techniques for blog site search.

• Our goal is to find relevant collections, i.e. blog sites, rather than relevant documents. Of course, we could use blog site search as a technique for improving posting search.

Page 7: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

7

Resource selection techniques for blog site search - Global Representation

• One of the simplest approaches to resource selection treats a collection as a single, large document .

• For a blog site search, we can generate a virtual document for a blog site by concatenating all postings in a blog.

• This virtual document Di for a blog site ci can then be represented using a language model and the query likelihood of the document for a query Q is used as a ranking function.

Page 8: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

8

Resource selection techniques for blog site search - Global Representation

• This technique has some problems. One of the problems is that the virtual document might be a mixture of various topics.

• We call this technique “global representation” and use it as the first baseline for our experiments.

q is a query term of query Q

tfq,Di is the number of times term q occurs in virtual document Di

|Di| is the length of virtual document Di

cfq is the number of times term q occurs in the entire collection|C| is the length of the collection

Page 9: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

9

Resource selection techniques for blog site search - Query Generation Maximization

• “unified utility maximization”, does resource selection to maximize a utility function.

• The utility function for the high-recall problem is defined as follows:

ci is a collection, i.e. {di1,di2, ···} , 1,2,... is the # of docs.NC is the number of total collections˜ ni is the number of the returned documents from the collection ci

I(ci) is an indicator function (1 if ci is selected and 0 otherwise) σ is a selection vector, i.e. [I(c1),I(c2), ··· ,I(cNC )] R(dij) is an estimated probability of relevance of the returned document dij .

Page 10: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

10

Resource selection techniques for blog site search - Query Generation Maximization

• Our goal is finding a selection vector to maximize the utility function with the limited number of selection.

• The problem is described as follows:

• Where Nσ is the predetermined number for selection.

• The optimized solution of this problem is selecting Nσ collections with the largest expected number of the relevant documents.

Page 11: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

11

Resource selection techniques for blog site search - Query Generation Maximization

• In order to apply this method to blog site search, we simplify the process as follows.

• Build an index of postings ignoring which blog site the postings are from.

• Since we already know statistics of each collection, we can directly translate the query likelihood score to the probability of relevance of the document R(dij ) for a given query without any estimation process.

where P( Q|dij ) is the query likelihood of the document dijfor the query Q.

Page 12: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

12

Resource selection techniques for blog site search - Query Generation Maximization

• In this case, the optimized solution is selecting Nσ collections with the highest expected generation of the query, i.e.

• We induce a ranking function based on the maximization.

• Simply sum the query likelihood scores of postings from the same blog site in the ranked list which is returned from the index.

Page 13: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

13

Resource selection techniques for blog site search - Pseudo-Cluster based Selection

• Distributed information retrieval using clustering is very eective because clustering redistributes documents in coff

llections and makes topic-based sub-collections. Our goal is not to find relevant documents using resource selection but to

find resources themselves.

• We create “pseudo-clusters” by ranking blog postings and then grouping highly-ranked postings from the same blog. To represent the pseudo-clusters, we borrow a method from cluster-based retrieval.

One of the biggest problems is that the representation of a cluster can be biased by some documents in the cluster.

Page 14: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

14

Resource selection techniques for blog site search - Pseudo-Cluster based Selection

• To avoid such a problem, we customize a new representation method. This method expresses probability distribution of words over clusters using a geometric mean as follows:

• We can easily compute a query likelihood of blog site ci by a geometric mean of query likelihoods of postings of blog site ci in the ranked list (under a unigram assumption) as follows.

w is a word, g is a clusterdj is a document in cluster gNg is the number of documents in cluster g

Page 15: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

15

Resource selection techniques for blog site search - Pseudo-Cluster based Selection

Unfair!!

Fix to

Page 16: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

16

Experiments - Design

• We do experiments for three resource selection techniques.

• For global representation, we built an index of each blog site after concatenating each posting from the same blog site. We used the query likelihood retrieval model as the ranking method for the global representation.

• Query generation maximization and pseudo-cluster selection require an initial retrieval. We built an index from all postings and used the query likelihood retrieval model for the initial run.

Page 17: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

17

Experiments - Training

• We performed exhaustive grid search to find optimal parameters for each technique. i.e. theμ parameter for Dirichlet smoothing.

• We used the normalized discounted cumulative gain (NDCG) , the mean average precision (MAP) and the precision at the rank 10 (P@10) as the evaluation measures.

Page 18: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

18

Experiments - Retrieval Performance

• Table 2 presents that two baselines, global representation and query generation maximization showed similar performance. Pseudo-cluster selection significantly outperformed the other techniques.

• In a practical sense, query generation maximization and pseudo-cluster selection have an advantage over global representation.

Page 19: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

19

Customizing the search

• Blog site search involves somewhat di erent strategies cffompared to resource selection due to specific features of blog sites.

• For better resource selection, it is desirable to choose collections which include a greater number of relevant documents.

• We discuss which customizations may be appropriate by first introducing several types of blog sites

Page 20: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

20

Customizing the search - Types of Blog Sites

• We classified blog sites into three types based on how they are managed and the degree of diversity of the topics covered.

• Type I is the diary type of blog. In this type, a blogger usually posts descriptions of their daily life. it is rare that other postings about similar topics are regularly updated in the blog site.

• Type II is the news blog. Documents covering a large number of topics are posted, and man

y of these blogs are managed by an organization or a company. Many general Web news sites also contain feed links for their subsc

ribers.Must to prevent.

Page 21: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

21

Customizing the search - Types of Blog Sites

• Type III is the topic-focused type of blog. This is managed by one or a few individuals and concentrates on a

small number of topics.

This type of blog site with a topic specialty exists for many topics.

• The success of our retrieval methods will depend on how well we are able to find this type of blog site for a given query.

Page 22: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

22

Customizing the search - Types of Blog Sites

• To verify the validity of our categories, we manually classified 100 blog sites randomly selected from the pools for relevance judgments.

• There were some cases that we could not decide which category a blog site is in because it did not match any category. Most of such blog sites were spam sites, We tagged such sites as “Unclassifiable”.

e.g., sites which do not contain real contents but instead are mostly advertisement links.

Page 23: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

23

Customizing the search - Types of Blog Sites

• Three annotators independently labeled the blog sites. By majority voting, we assigned the label which more than two annotators agreed with to each blog site. If all annotators had di

erent labels for a blog site, then we tagged the site as ”Unclffassifiable”.

• As we expected, the majority of relevant blog sites were in the topic-focused category.

Page 24: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

24

Customizing the search - Diversity Penalty

• We need to penalize Type I and Type II blog sites. • To do this, we focus on the fact that they are not topic-ce

ntric. Accordingly, we considered a method for penalizing blog sites with diverse topics.

• We have to decide whether or not the blog site is topic-centric at the global level, i.e. the blog site level. Therefore,the penalty should be able to be used at the global level.

Page 25: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

25

Customizing the search - Diversity Penalty by Global Representation

• The query likelihood score from the global representation could be used as a diversity penalty.

• We compute the score at the global level. Further, if the blog site deals with the diverse topics, then the distribution of the words in the blog site are probably widely scattered.

Page 26: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

26

Customizing the search - Diversity Penalty by Global Representation

• We analyzed the distribution of the number of postings in the returned blog sites according to the above mentioned techniques.

• As we can see from the histogram in Figure 1, the global representation definitely returned much fewer blog sites which have a large number of postings.

• In summary, the query likelihood score can be useful as a measure of diversity of blog sites. Furthermore, this score reflects the relevance of the blog site for the given topic.

Page 27: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

27

Customizing the search - Diversity Penalty by Global Representation

• Accordingly, to supplement the other two resource selection techniques, we can use this score as a penalty factor for diversity by multiplying it by the previous ranking function as follows.

Page 28: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

28

Customizing the search - Clarity Score as a Penalty Factor

• We compute the clarity score by using the Kullback-Leibler divergence between a blog site and the whole collection as follows.

• We also use this score as a penalty factor for diversity by multiplying it by th previous ranking function as follows.

Page 29: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

29

Customizing the search - Diversity Penalty by Random Sampling

• We randomly sample M postings from a blog site to obtain postings independent of any topic. And compute the query likelihoods for the sampled postings with the given query.

• If the blog site is topic-centric and relevant to the topic, then the postings are likely to relevant to the topic and the query likelihoods have high values.

• Therefore, the query likelihoods can be used for estimating diversity of a blog site.

Page 30: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

30

Customizing the search - Diversity Penalty by Random Sampling

• We make a diversity penalty factor with the query likelihoods of the randomly sampled postings in the same way as used in pseudo-cluster selection.

• We compute a geometric mean of the query likelihoods.

Page 31: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

31

Customizing the search - Experimental Results

Page 32: 1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04

32

Conclusion

• We defined the properties of blog sites and the goal of blog site search. Based on this goal, we introduced various resource selection algorithms for site search in blog collections.

• We classified the types of blog sites and claimed that an appropriate penalty factor reflecting the diversity of the topics of each blog site is required.

• Our experiments demonstrated that pseudo-cluster selection combined with a global representation penalty outperformed the other methods in all situation.