Overview of Information Retrieval and our Solutions

1

Overview of Information Retrieval and our Solutions

Qiang Yang

Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology

Hong Kong

2

Why Need Information Retrieval (IR)?

More and more online information in general (Information Overload)

Many tasks rely on effective management and exploitation of information

Textual information plays an important role in our lives

Effective text management directly improves productivity

3

What is IR?

Narrow-sense: IR= Search engine technologies

(Google/Yahoo!/Live Search) IR= Text matching/classification

Broad-sense: IR = Text information management:

How to find useful information? (info. retrieval) (e.g., Yahoo!)

How to organize information? (text classification) (e.g., automatically assign email to different folders)

How to discover knowledge from text? (text mining) (e.g., discover correlation of events)

4

Difficulties Huge Amount of Online Data

Yahoo! has nearly 20 billion pages in its index (as collected at the beginning of 2005)

Different types of data Web-pages, emails, blogs, chatting-room

messages; Ambiguous Queries

Short: 2-4 words Ambiguous: apple; bank…

5

Our Solutions Query Classification

Champion of KDDCUP’05; TOIS (Vol. 24); SIGIR’06; KDD Exploration (Vol. 7)

Query Expansion/Suggestion Submissions to: SIGIR’07; AAAI’07; KDD’07

Entity Resolution Submission to SIGIR’07

Web page Classification/Clustering SIGIR’04; CIKM’04; ICDM’04; ICDE’06; WWW’06; IPM (2007),

DMKD (Vol. 12) Document Summarization

SIGIR’05; IJCAI’07 Analysis of Blogs, Emails, Chatting-room

messages SIGIR’06; ICDM’06 (2); IJCAI’07

6

Outline

Query Classification (QC) Introduction Solution 1: Query/category

enrichment; Solution 2: Bridging classifiers;

Entity Resolution Summary of Other works

7

Query Classification

8

Introduction Web-Query is difficult to manage:

Short; Ambiguous; Evolving

Query Classification (QC) can help to understand query better

Vertical Search Re-rank search results Online Advertisements

Difficulties of QC (Different from text classification) How to represent queries Target taxonomy is dynamic, e.g. online ads

taxonomy Training data is difficult to collect

9

Problem Definition

Inspired by the KDDCUP’05 competition Classify a query into a ranked list of

categories Queries are collected from real search

engines Target categories are organized in a tree

with each node being a category

10

Related Work

Document Classification Feature selection [Yang et al. 1997] Feature generation [Cai et al. 2003] Classification algorithms

Naïve Bayes [Andrew and Nigam 1998] KNN [Yang 1999] SVM [Joachims 1999] ……

An overall survey in [Sebastiani 2002]

11

Related work Query Classification/Clustering

Classify the Web queries by geographical locality [Gravano 2003];

Classify queries according to their functional types [Kang 2003];

Beitzel et al. studied the topical classification as we do. However they have manually classified data [Beitzel 2005];

Beeferman and Wen worked on query clustering using clickthrough data respectively [Beeferman 2000; Wen 2001];

12

Related Work Document/Query Expansion

Borrow text from extra data source Using hyperlink [Glover 2002]; Using implicit links from query log [Shen

2006]; Using existing taxonomies [Gabrilovich

2005]; Query expansion [Manning 2007]

Global methods: independent of the queries Local methods using relevance feedback or

pseudo-relevance feedback

13

Solutions

Queries Target Categories

Target Categories

Queries

Solution 1: Query/Category Enrichment

Solution 2: Bridging classifier


14


Assumptions & Architecture Query Enrichment Classifiers

Synonym-based classifiers Statistical classifiers

Experiments

15

Assumptions & Architecture The intended meanings of Web queries should

be reflected by the Web; A set of objects exist that cover the target

categories.Construction of Synonym- based

Classifiers

Construction of Statistical Classifier

QuerySearch Engine

Labels of Returned

Pages

Text of Returned

Pages

Classified results

Classified results

Finial ResultsPhase II: the testing phase

Phase I: the training phase

The Architecture of Our Approach

16

Category information

Full text

Query enrichment Textual information

TitleSnippet Category

17

Synonym-based classifiers

C*

Page 1

Page 4

Page 3

Page 2Query

1TC

3TC

2TC

1IC

3IC

4IC

2IC

CategoryMapping

18

Map by Word Matching Direct Matching

High precision, low recall

Synonym-based classifiers

Device

DE

Extended Matching Wordnet “Hardware" → “Hardware; Device ; Equipment“

19

Statistical classifiers: SVM

Apply synonym-based classifiers to map Web pages from intermediate taxonomy to target taxonomy

Obtain <pages, target category> as the training data

Train SVM classifiers for the target categories;

20

Statistical Classifier: SVM

Advantages Circles (triangles) denote

crawled pages Black ones are mapped to the

two categories successfully Fail to map the white ones; For a query, if it happens to be

represented by the white ones, it can not be classified correctly by synonym-based method, but SVM can Disadvantages

Recall can be higher, but precision may hurt Once the target taxonomy changes, we need

to train classifiers again

21

Putting them together: Ensemble of classifiers

Why ensemble? Two kinds of classifiers based on different mechanisms They can be complementary to each other Proper combination can improve the performance

Combination strategies EV (Use validation data) EN (No validation data)

22

Experiment--Data Sets & Eval. Criteria Queries: from KDDCUP 2005

800,000 queries, 800 labeled; three labelers

Evaluation

i

icastaggedcorrectlyarequeriesofA #:

i

icastaggedarequeriesofB #:

i

iciscategorywhosequeriesofC #:

B

APrecision

C

ARecall

RecallPresion

RecallF

Precision2

1

3

1

i)labeler human against (F13

1 F1 Overall

i

23

Experiment: Quality of the Data Sets

Consistency between labelers

The distribution of the labels assigned by the three labelers.

Performance of each labeler against another labelers

24

Experiment Results--Direct vs. Extended Matching Number of pages collected for training

using different mapping methods

F1 of the synonym based classifier and SVM

25

Experiment Results--The number of assigned labels

0.20

0.30

0.40

0.50

0.60

0.70

1 2 3 4 5 6

Number of guessed labels

Pre

S1 S2 S3

SVM EN EDP

0.10

0.20

0.30

0.40

0.50

0.60

1 2 3 4 5 6


Rec

S1 S2 S3

SVM EN EDP

0.20

0.25

0.30

0.35

0.40

0.45

1 2 3 4 5 6


F1

S1 S2 S3

SVM EN EDP

26

Experiment Results-- Effect of Base Classifiers

27

Solutions

Queries Target Categories

Target Categories

Queries


Solution 2: Bridging classifierSolution 2: Bridging classifier

28

Solution2: Bridging Classifiers

Our Algorithm Bridging Classifier Category Selection

Experiments Data Set and Evaluation Criteria Results and Analysis

29

Algorithm--Bridging Classifier

Problem with Solution 1: target if fixed, and training needs to repeat

Goal: Connect the target taxonomy and queries by

taking an intermediate taxonomy as a bridge

30

Algorithm--Bridging Classifier (Cont.)

How to connect?

Prior prob. of IjC

The relation between and I

jC

TiC

The relation between and I

jC

q

The relation between andTiC

q

31

Algorithm--Bridging Classifier (Cont.)

Understand the Bridging Classifier

Given and : and are fixed

and

which reflects the size of acts as a weighting factor

tends to be larger when and tend to belong to the same smaller intermediate categories

q

q

V

32

Algorithm--Category Selection

Category Selection for Reducing Complexity

Total Probability (TP)

Mutual Information

33

Experiment--Data Sets and Eval. Criteria

Intermediate taxonomy ODP: 1.5M Web pages, in 172,565

categories

Number of Categories on Different Levels

Statistics of the Numbers of Documents in the Categories on Different Levels

34

Experiment--Result of Bridging Classifiers

All intermediate categories are used Snippet only Best result when n = 60 Improvement by 10.4% and 7.1% in terms

of precision and F1 respectively compared to two previous approaches

35

Experiment--Result of Bridging Classifiers

Best results when using all intermediate categories Reason:

A category with larger granularity may be a mixture of several target categories

It can not be used to distinguish different target categories

Performances of the Bridging Classifier with Different Granularity of Intermediate

Taxonomy

36

Experiment--Effect of category selection

MI works better than TP It favors the categories which are more

powerful to distinguish the target categories

When the category number is around 18,000, the bridging classifier is comparable to, if not better than, the previous approaches

37

Entity Resolution

Definition: Reference & Entity Tsz-Chiu Au, Dana S. Nau: The Incompleteness

of Planning with Volatile External Information. ECAI 2006

Tsz-Chiu Au, Dana S. Nau: Maintaining Cooperation in Noisy Environments. AAAI 2006

Name Reference

Venue Reference

Author Entity

Journal /Conf.Entity

Current Author Search

DBLP CiteSeer Google

All of t

hem retu

rn

the M

IXED lis

t of

refe

rence

s

All of t

hem retu

rn

the M

IXED lis

t of

refe

rence

s

Graphical Model We convert the Entity Resolution into a

Graph Partition Problem Each node denotes

a reference Each edge denotes

the relation of tworeferences

How to measure the Reference Relation

Tsz-Chiu Au, Dana S. Nau: The Incompleteness of Planning with Volatile External Information. ECAI 2006

Ugur Kuter, Dana S. Nau: Using Domain-Configurable Search Control for Probabilistic Planning. AAAI 2005:

CoauthorsAuthors

Research CommunityResearch Area

CoauthorsAuthors

Plaintext Similarity

Features

F1: Title Similarity F2: Coauthor Similarity F3: Venue Similarity F4: Research Community Overlap F5: Research Area Overlap

Research Community Overlap

A1, A2 stands for two author name references F4.1:Similarity(A1, A2)

=Coauthors(Coauthors(A1))∩Coauthors(Coauthors(A2)) F4.2:Similarity(A1, A2)

=Venues(Coauthors(A1))∩Venues(Coauthors(A2))

Coauthors(X) returns the coauthor name set of eachauthor in set X

Venues(Y) returns the venue name set of eachauthor in set Y

Research Area Overlap V1, V2 stands for two venue references F4.1:Similarity(V1, V2)

=Authors(Articles(V1))∩Authors(Articles(V2)) F4.2:Similarity(V1, V2)

=Articles(Authors(Articles(V1)))∩Articles(Authors(Articles(V2)))

Authors(X) returns the author name set of eacharticle in set X

Articles(Y) returns the article set holding a referenceof each element in set Y

System Framework

SimilaritySimilarity

ProbabilityProbability

Experiment Results Our Dataset:

1000 references to 20 author entities from DBLP

Getoor’s DatasetsCiteSeer: 2,892 author references to 1,165 author entitiesarXiv: 58,515 references to 9,200 author entitiesF1 = 97.0%

47

Summary of Other Work

48

Summary of Other Work

Summarization using Conditional Random Fields (IJCAI ’07)

Thread Detection in Dynamic Text Message Streams (SIGIR ’06)

Implicit Links for Web Page Classification (WWW ’06) Text Classification Improved by Multigram Models

(CIKM ’06) Latent Friend Mining from Blog Data (ICDM ’06) Web-page Classification through Summarization

(SIGIR ’04)

49

Summarization using Conditional Random Fields (IJCAI ’07)

Motivation Observation

Summarization Sequence labeling Solution: CRF

Feature functions: , Parameters: ,

1 2 3 4 5 6

1 2 3 4 5 6

1 2 3 4 5 6

Step 1:

Step 2:

Step 3:

yt-1

xt-1

yt+1

xt+1

yt

xtSentence

(Observed)

Label(Unobserved)

50

Representation Content-based Structure-based

Sentence Type; Personal Pronouns

Clustering

Thread Detection in Dynamic Text Message Streams (SIGIR ’06)

51

Implicit Links for Web Page Classification (WWW ’06) Implicit link 1 ( LI1)

Assumption: a user tends to click the pages related to the issued query;

Definition: there is an LI1 between d1 and d2 if they are clicked by the same person through the same query;

Implicit link 2 (LI2) Assumption: users tend to click related

pages according to the same query Definition: there is an LI2 between d1 and

d2 if they are clicked according to the same query

52

Text Classification Improved by Multigram Models (CIKM ’06)

Training Stage: For each category

Train an n-multigram model Train an n-gram model on the

sequences

Test Stage: For a test document

For each category, segment the document

Calculate its probability under the corresponding n-gram model

Assign the test document the category under which it has the largest probability

53

Latent Friend Mining from Blog Data (ICDM ’06)

Objective One way to build Web communities Find the people sharing similar

interest with the target person “Interest” is reflected by their “writings” “Writings” are from their “blogs” These people may not know each other They are not linked as in previous study

54

Latent Friend Mining from Blog Data (Cont.) Solutions

Cosine Similarity-based method Calculating the cosine similarity between the

contents of the blogs. Topic Model-based method

Find latent topics in the blogs using latent topic models and calculate the similarity at topic level

Two-level similarity-based method First stage: use an existing topic hierarchy to get

the topic distribution of a blogger’s blogs; Second stage: use a detailed similarity

comparison

55

Web-page Classification through Summarization (SIGIR ’04)

Testing set

Train set

Train Summaries

Testing Summaries

Classifier

Result

Combined

Summarizer

LUHN LSA Supervised

Page-layout analysis

Description

56

Thanks