56
1 Overview of Information Retrieval and our Solutions Qiang Yang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong

Overview of Information Retrieval and our Solutions

  • Upload
    paniz

  • View
    45

  • Download
    7

Embed Size (px)

DESCRIPTION

Overview of Information Retrieval and our Solutions. Qiang Yang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong. More and more online information in general (Information Overload) - PowerPoint PPT Presentation

Citation preview

Page 1: Overview of Information Retrieval and our Solutions

1

Overview of Information Retrieval and our Solutions

Qiang Yang

Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology

Hong Kong

Page 2: Overview of Information Retrieval and our Solutions

2

Why Need Information Retrieval (IR)?

More and more online information in general (Information Overload)

Many tasks rely on effective management and exploitation of information

Textual information plays an important role in our lives

Effective text management directly improves productivity

Page 3: Overview of Information Retrieval and our Solutions

3

What is IR?

Narrow-sense: IR= Search engine technologies

(Google/Yahoo!/Live Search) IR= Text matching/classification

Broad-sense: IR = Text information management:

How to find useful information? (info. retrieval) (e.g., Yahoo!)

How to organize information? (text classification) (e.g., automatically assign email to different folders)

How to discover knowledge from text? (text mining) (e.g., discover correlation of events)

Page 4: Overview of Information Retrieval and our Solutions

4

Difficulties Huge Amount of Online Data

Yahoo! has nearly 20 billion pages in its index (as collected at the beginning of 2005)

Different types of data Web-pages, emails, blogs, chatting-room

messages; Ambiguous Queries

Short: 2-4 words Ambiguous: apple; bank…

Page 5: Overview of Information Retrieval and our Solutions

5

Our Solutions Query Classification

Champion of KDDCUP’05; TOIS (Vol. 24); SIGIR’06; KDD Exploration (Vol. 7)

Query Expansion/Suggestion Submissions to: SIGIR’07; AAAI’07; KDD’07

Entity Resolution Submission to SIGIR’07

Web page Classification/Clustering SIGIR’04; CIKM’04; ICDM’04; ICDE’06; WWW’06; IPM (2007),

DMKD (Vol. 12) Document Summarization

SIGIR’05; IJCAI’07 Analysis of Blogs, Emails, Chatting-room

messages SIGIR’06; ICDM’06 (2); IJCAI’07

Page 6: Overview of Information Retrieval and our Solutions

6

Outline

Query Classification (QC) Introduction Solution 1: Query/category

enrichment; Solution 2: Bridging classifiers;

Entity Resolution Summary of Other works

Page 7: Overview of Information Retrieval and our Solutions

7

Query Classification

Page 8: Overview of Information Retrieval and our Solutions

8

Introduction Web-Query is difficult to manage:

Short; Ambiguous; Evolving

Query Classification (QC) can help to understand query better

Vertical Search Re-rank search results Online Advertisements

Difficulties of QC (Different from text classification) How to represent queries Target taxonomy is dynamic, e.g. online ads

taxonomy Training data is difficult to collect

Page 9: Overview of Information Retrieval and our Solutions

9

Problem Definition

Inspired by the KDDCUP’05 competition Classify a query into a ranked list of

categories Queries are collected from real search

engines Target categories are organized in a tree

with each node being a category

Page 10: Overview of Information Retrieval and our Solutions

10

Related Work

Document Classification Feature selection [Yang et al. 1997] Feature generation [Cai et al. 2003] Classification algorithms

Naïve Bayes [Andrew and Nigam 1998] KNN [Yang 1999] SVM [Joachims 1999] ……

An overall survey in [Sebastiani 2002]

Page 11: Overview of Information Retrieval and our Solutions

11

Related work Query Classification/Clustering

Classify the Web queries by geographical locality [Gravano 2003];

Classify queries according to their functional types [Kang 2003];

Beitzel et al. studied the topical classification as we do. However they have manually classified data [Beitzel 2005];

Beeferman and Wen worked on query clustering using clickthrough data respectively [Beeferman 2000; Wen 2001];

Page 12: Overview of Information Retrieval and our Solutions

12

Related Work Document/Query Expansion

Borrow text from extra data source Using hyperlink [Glover 2002]; Using implicit links from query log [Shen

2006]; Using existing taxonomies [Gabrilovich

2005]; Query expansion [Manning 2007]

Global methods: independent of the queries Local methods using relevance feedback or

pseudo-relevance feedback

Page 13: Overview of Information Retrieval and our Solutions

13

Solutions

Queries Target Categories

Target Categories

Queries

Solution 1: Query/Category Enrichment

Solution 2: Bridging classifier

Solution 1: Query/Category Enrichment

Page 14: Overview of Information Retrieval and our Solutions

14

Solution 1: Query/Category Enrichment

Assumptions & Architecture Query Enrichment Classifiers

Synonym-based classifiers Statistical classifiers

Experiments

Page 15: Overview of Information Retrieval and our Solutions

15

Assumptions & Architecture The intended meanings of Web queries should

be reflected by the Web; A set of objects exist that cover the target

categories.Construction of Synonym- based

Classifiers

Construction of Statistical Classifier

QuerySearch Engine

Labels of Returned

Pages

Text of Returned

Pages

Classified results

Classified results

Finial ResultsPhase II: the testing phase

Phase I: the training phase

The Architecture of Our Approach

Page 16: Overview of Information Retrieval and our Solutions

16

Category information

Full text

Query enrichment Textual information

TitleSnippet Category

Page 17: Overview of Information Retrieval and our Solutions

17

Synonym-based classifiers

C*

Page 1

Page 4

Page 3

Page 2Query

1TC

3TC

2TC

1IC

3IC

4IC

2IC

CategoryMapping

Page 18: Overview of Information Retrieval and our Solutions

18

Map by Word Matching Direct Matching

High precision, low recall

Synonym-based classifiers

Device

DE

Extended Matching Wordnet “Hardware" → “Hardware; Device ; Equipment“

Page 19: Overview of Information Retrieval and our Solutions

19

Statistical classifiers: SVM

Apply synonym-based classifiers to map Web pages from intermediate taxonomy to target taxonomy

Obtain <pages, target category> as the training data

Train SVM classifiers for the target categories;

Page 20: Overview of Information Retrieval and our Solutions

20

Statistical Classifier: SVM

Advantages Circles (triangles) denote

crawled pages Black ones are mapped to the

two categories successfully Fail to map the white ones; For a query, if it happens to be

represented by the white ones, it can not be classified correctly by synonym-based method, but SVM can Disadvantages

Recall can be higher, but precision may hurt Once the target taxonomy changes, we need

to train classifiers again

Page 21: Overview of Information Retrieval and our Solutions

21

Putting them together: Ensemble of classifiers

Why ensemble? Two kinds of classifiers based on different mechanisms They can be complementary to each other Proper combination can improve the performance

Combination strategies EV (Use validation data) EN (No validation data)

Page 22: Overview of Information Retrieval and our Solutions

22

Experiment--Data Sets & Eval. Criteria Queries: from KDDCUP 2005

800,000 queries, 800 labeled; three labelers

Evaluation

i

icastaggedcorrectlyarequeriesofA #:

i

icastaggedarequeriesofB #:

i

iciscategorywhosequeriesofC #:

B

APrecision

C

ARecall

RecallPresion

RecallF

Precision2

1

3

1

i)labeler human against (F13

1 F1 Overall

i

Page 23: Overview of Information Retrieval and our Solutions

23

Experiment: Quality of the Data Sets

Consistency between labelers

The distribution of the labels assigned by the three labelers.

Performance of each labeler against another labelers

Page 24: Overview of Information Retrieval and our Solutions

24

Experiment Results--Direct vs. Extended Matching Number of pages collected for training

using different mapping methods

F1 of the synonym based classifier and SVM

Page 25: Overview of Information Retrieval and our Solutions

25

Experiment Results--The number of assigned labels

0.20

0.30

0.40

0.50

0.60

0.70

1 2 3 4 5 6

Number of guessed labels

Pre

S1 S2 S3

SVM EN EDP

0.10

0.20

0.30

0.40

0.50

0.60

1 2 3 4 5 6

Number of guessed labels

Rec

S1 S2 S3

SVM EN EDP

0.20

0.25

0.30

0.35

0.40

0.45

1 2 3 4 5 6

Number of guessed labels

F1

S1 S2 S3

SVM EN EDP

Page 26: Overview of Information Retrieval and our Solutions

26

Experiment Results-- Effect of Base Classifiers

Page 27: Overview of Information Retrieval and our Solutions

27

Solutions

Queries Target Categories

Target Categories

Queries

Solution 1: Query/Category Enrichment

Solution 2: Bridging classifierSolution 2: Bridging classifier

Page 28: Overview of Information Retrieval and our Solutions

28

Solution2: Bridging Classifiers

Our Algorithm Bridging Classifier Category Selection

Experiments Data Set and Evaluation Criteria Results and Analysis

Page 29: Overview of Information Retrieval and our Solutions

29

Algorithm--Bridging Classifier

Problem with Solution 1: target if fixed, and training needs to repeat

Goal: Connect the target taxonomy and queries by

taking an intermediate taxonomy as a bridge

Page 30: Overview of Information Retrieval and our Solutions

30

Algorithm--Bridging Classifier (Cont.)

How to connect?

Prior prob. of IjC

The relation between and I

jC

TiC

The relation between and I

jC

q

The relation between andTiC

q

Page 31: Overview of Information Retrieval and our Solutions

31

Algorithm--Bridging Classifier (Cont.)

Understand the Bridging Classifier

Given and : and are fixed

and

which reflects the size of acts as a weighting factor

tends to be larger when and tend to belong to the same smaller intermediate categories

q

q

V

Page 32: Overview of Information Retrieval and our Solutions

32

Algorithm--Category Selection

Category Selection for Reducing Complexity

Total Probability (TP)

Mutual Information

Page 33: Overview of Information Retrieval and our Solutions

33

Experiment--Data Sets and Eval. Criteria

Intermediate taxonomy ODP: 1.5M Web pages, in 172,565

categories

Number of Categories on Different Levels

Statistics of the Numbers of Documents in the Categories on Different Levels

Page 34: Overview of Information Retrieval and our Solutions

34

Experiment--Result of Bridging Classifiers

All intermediate categories are used Snippet only Best result when n = 60 Improvement by 10.4% and 7.1% in terms

of precision and F1 respectively compared to two previous approaches

Page 35: Overview of Information Retrieval and our Solutions

35

Experiment--Result of Bridging Classifiers

Best results when using all intermediate categories Reason:

A category with larger granularity may be a mixture of several target categories

It can not be used to distinguish different target categories

Performances of the Bridging Classifier with Different Granularity of Intermediate

Taxonomy

Page 36: Overview of Information Retrieval and our Solutions

36

Experiment--Effect of category selection

MI works better than TP It favors the categories which are more

powerful to distinguish the target categories

When the category number is around 18,000, the bridging classifier is comparable to, if not better than, the previous approaches

Page 37: Overview of Information Retrieval and our Solutions

37

Entity Resolution

Page 38: Overview of Information Retrieval and our Solutions

Definition: Reference & Entity Tsz-Chiu Au, Dana S. Nau: The Incompleteness

of Planning with Volatile External Information. ECAI 2006

Tsz-Chiu Au, Dana S. Nau: Maintaining Cooperation in Noisy Environments. AAAI 2006

Name Reference

Venue Reference

Author Entity

Journal /Conf.Entity

Page 39: Overview of Information Retrieval and our Solutions

Current Author Search

DBLP CiteSeer Google

All of t

hem retu

rn

the M

IXED lis

t of

refe

rence

s

All of t

hem retu

rn

the M

IXED lis

t of

refe

rence

s

Page 40: Overview of Information Retrieval and our Solutions

Graphical Model We convert the Entity Resolution into a

Graph Partition Problem Each node denotes

a reference Each edge denotes

the relation of tworeferences

Page 41: Overview of Information Retrieval and our Solutions

How to measure the Reference Relation

Tsz-Chiu Au, Dana S. Nau: The Incompleteness of Planning with Volatile External Information. ECAI 2006

Ugur Kuter, Dana S. Nau: Using Domain-Configurable Search Control for Probabilistic Planning. AAAI 2005:

CoauthorsAuthors

Research CommunityResearch Area

CoauthorsAuthors

Plaintext Similarity

Page 42: Overview of Information Retrieval and our Solutions

Features

F1: Title Similarity F2: Coauthor Similarity F3: Venue Similarity F4: Research Community Overlap F5: Research Area Overlap

Page 43: Overview of Information Retrieval and our Solutions

Research Community Overlap

A1, A2 stands for two author name references F4.1:Similarity(A1, A2)

=Coauthors(Coauthors(A1))∩Coauthors(Coauthors(A2)) F4.2:Similarity(A1, A2)

=Venues(Coauthors(A1))∩Venues(Coauthors(A2))

Coauthors(X) returns the coauthor name set of eachauthor in set X

Venues(Y) returns the venue name set of eachauthor in set Y

Page 44: Overview of Information Retrieval and our Solutions

Research Area Overlap V1, V2 stands for two venue references F4.1:Similarity(V1, V2)

=Authors(Articles(V1))∩Authors(Articles(V2)) F4.2:Similarity(V1, V2)

=Articles(Authors(Articles(V1)))∩Articles(Authors(Articles(V2)))

Authors(X) returns the author name set of eacharticle in set X

Articles(Y) returns the article set holding a referenceof each element in set Y

Page 45: Overview of Information Retrieval and our Solutions

System Framework

SimilaritySimilarity

ProbabilityProbability

Page 46: Overview of Information Retrieval and our Solutions

Experiment Results Our Dataset:

1000 references to 20 author entities from DBLP

Getoor’s DatasetsCiteSeer: 2,892 author references to 1,165 author entitiesarXiv: 58,515 references to 9,200 author entitiesF1 = 97.0%

Page 47: Overview of Information Retrieval and our Solutions

47

Summary of Other Work

Page 48: Overview of Information Retrieval and our Solutions

48

Summary of Other Work

Summarization using Conditional Random Fields (IJCAI ’07)

Thread Detection in Dynamic Text Message Streams (SIGIR ’06)

Implicit Links for Web Page Classification (WWW ’06) Text Classification Improved by Multigram Models

(CIKM ’06) Latent Friend Mining from Blog Data (ICDM ’06) Web-page Classification through Summarization

(SIGIR ’04)

Page 49: Overview of Information Retrieval and our Solutions

49

Summarization using Conditional Random Fields (IJCAI ’07)

Motivation Observation

Summarization Sequence labeling Solution: CRF

Feature functions: , Parameters: ,

1 2 3 4 5 6

1 2 3 4 5 6

1 2 3 4 5 6

Step 1:

Step 2:

Step 3:

yt-1

xt-1

yt+1

xt+1

yt

xtSentence

(Observed)

Label(Unobserved)

Page 50: Overview of Information Retrieval and our Solutions

50

Representation Content-based Structure-based

Sentence Type; Personal Pronouns

Clustering

Thread Detection in Dynamic Text Message Streams (SIGIR ’06)

Page 51: Overview of Information Retrieval and our Solutions

51

Implicit Links for Web Page Classification (WWW ’06) Implicit link 1 ( LI1)

Assumption: a user tends to click the pages related to the issued query;

Definition: there is an LI1 between d1 and d2 if they are clicked by the same person through the same query;

Implicit link 2 (LI2) Assumption: users tend to click related

pages according to the same query Definition: there is an LI2 between d1 and

d2 if they are clicked according to the same query

Page 52: Overview of Information Retrieval and our Solutions

52

Text Classification Improved by Multigram Models (CIKM ’06)

Training Stage: For each category

Train an n-multigram model Train an n-gram model on the

sequences

Test Stage: For a test document

For each category, segment the document

Calculate its probability under the corresponding n-gram model

Assign the test document the category under which it has the largest probability

Page 53: Overview of Information Retrieval and our Solutions

53

Latent Friend Mining from Blog Data (ICDM ’06)

Objective One way to build Web communities Find the people sharing similar

interest with the target person “Interest” is reflected by their “writings” “Writings” are from their “blogs” These people may not know each other They are not linked as in previous study

Page 54: Overview of Information Retrieval and our Solutions

54

Latent Friend Mining from Blog Data (Cont.) Solutions

Cosine Similarity-based method Calculating the cosine similarity between the

contents of the blogs. Topic Model-based method

Find latent topics in the blogs using latent topic models and calculate the similarity at topic level

Two-level similarity-based method First stage: use an existing topic hierarchy to get

the topic distribution of a blogger’s blogs; Second stage: use a detailed similarity

comparison

Page 55: Overview of Information Retrieval and our Solutions

55

Web-page Classification through Summarization (SIGIR ’04)

Testing set

Train set

Train Summaries

Testing Summaries

Classifier

Result

Combined

Summarizer

LUHN LSA Supervised

Page-layout analysis

Description

Page 56: Overview of Information Retrieval and our Solutions

56

Thanks