23
Type-enabled Keyword Searches with Uncertain Schema Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soume n

Type-enabled Keyword Searches with Uncertain Schema Soumen Chakrabarti IIT Bombay soumen

Embed Size (px)

Citation preview

Type-enabled Keyword Searches with Uncertain Schema

Soumen Chakrabarti

IIT Bombay

www.cse.iitb.ac.in/~soumen

ICML 2005 Chakrabarti 2

Evolution of Web search The first decade of Web search

• Crawling and indexing at massive scale• Macroscopic whole-page connectivity analysis• Very limited expression of information need

Exploiting entities and relations—clear trend• Maintaining large type systems and ontologies• Discovering mentions of entities and relations• Deduplicating and canonicalizing mentions• Forming uncertain, probabilistic E-R graphs• Enhancing keyword or schema-aware queries

ICML 2005 Chakrabarti 3

DisambiguationNamed entity tagging

Relation tagging

Raw corpus

Annotated corpus

Textindex

Annotationindex

Indexer

Past query workload stats

Ranking engineQuestion

Answer typepredictor

Keywordmatch predictor Response snippets

Wor

dNet

Wik

iped

ia

Fra

meN

et

Kno

wIt

All

Uniform lexicalnetwork provider

1

2

3

4

ICML 2005 Chakrabarti 4

Populating entity and relation tables Hearst patterns (Hearst 1992)

• T such as x, x and other T, x is a T

DIPRE (Brin 1998) Snowball (Agichtein+ 2000)

• [left] entity1 [middle] entity2 [right]

PMI-IR (Turney 2001)• Recognize synonyms using Web stats

KnowItAll (Etzioni+ 2004) C-PANKOW (Cimiano+ 2005)

• Is-a relations from Hearst patterns, lists, PMI

ICML 2005 Chakrabarti 5

DIPRE and Snowball

Seed tuples

Tag mentions in free text

Generate extraction patterns

Locate new tuples

Augmented table

… the Irving-based Exxon Corporation …

location organization

ℓ m r

Encoded as bag-of-words

ICML 2005 Chakrabarti 6

Scoring patterns and tuples

Pattern confidence = m+/(m+ + m−) over validation tuples

Soft-or tuple confidence =

Recent improvements: Urn model (Etzioni+ 2005)

i ii tpmatchpconf ),()(11

Uses 5-part encoding

DIPRESnowball

ICML 2005 Chakrabarti 7

KnowItAll and C-PANKOW A “propose-validate” approach

• Using existing patterns, generate queries• For each web page w returned

• Extract potential fact e andassign confidence score

• Add fact to database if ithas high enough score

Patterns use chunk info

ICML 2005 Chakrabarti 8

Exploiting answer types with PMI From two word queries to two text boxes

• author; “Harry Potter”• person; “Eiffel Tower”• director; Swades movie• city; India Pakistan cricket

Keywords search engine snippets Every token/chunk in a snippet is a candidate

• Elimination hacks that we won’t discuss

Fire Hearst pattern queries between desired answer type and candidate token/chunk

Keywordsto match

Answer type

ICML 2005 Chakrabarti 9

Information carnivores at work

Probe Word PhraseKhalid 1.3M 0Omar 6.63M 0sort 130M 0Karachi 2.51M 629Pakistan 50.5M 1

“Garth Brooks is a country” [singer],“gift such as wall” [clock]

“person like Paris” [Hilton],“researchers like Michael Jordan” (which one?)

KO :: India Pakistan Cricket SeriesA web site by Khalid Omar, sort of live from Karachi, Pakistan.

“cities such as [probe]”

“[probe] and othercities”, “[probe] is acity”, etc.

ICML 2005 Chakrabarti 10

Sample output author; “Harry Potter”

• J K Rowling, Ron

person; “Eiffel Tower”• Gustave, (Eiffel), Paris

director; Swades movie• Ashutosh Gowariker, Ashutosh Gowarikar

What can search engines do to help?• Cluster mentions and assign IDs • Allow queries for IDs — expensive!• “Harry Potter” context in “Ron is an author”

Ambiguity andextremely skewed

Web popularity

ICML 2005 Chakrabarti 11

DisambiguationNamed entity tagging

Relation tagging

Raw corpus

Annotated corpus

Textindex

Annotationindex

Indexer

Past query workload stats

Ranking engineQuestion

Answer typepredictor

Keywordmatch predictor Response snippets

Wor

dNet

Wik

iped

ia

Fra

meN

et

Kno

wIt

All

Uniform lexicalnetwork provider

1

2

3

4

ICML 2005 Chakrabarti 12

Answer type (atype) prediction Standard sub-problem in question answering Increasingly important (but more difficult) for

grammar-free Web queries (Broder 2002) Current approaches

• Pattern matching, e.g. head of noun phrase adjacent to what or which; map when, who, where, directly to classes time, person, place

• Coupled perceptrons (Li and Roth, 2002)• Linear SVM on bag-of-2grams (Hacioglu 2002)• SVM with tree kernel on parse (Zhang and Lee,

2004): slim gains Surely a parse tree holds more usable info

ICML 2005 Chakrabarti 13

Informer span A short, contiguous span of question tokens

reveals the anticipated answer type (atype) Except in multi-function questions, one

informer span is dominant and sufficient• What is the weight of a rhino?• How much does a rhino weigh?• How much does a rhino cost?• Who is the CEO of IBM?

Question parse informer span tagger Learn atype label from informer + question

ICML 2005 Chakrabarti 14

Example

Pre-in-post Markov process produces question Train a CRF with features derived from parse tree

• POS, attachments to neighboring chunks, multiple levels• First noun chunk? Adjacent to second verb?

What is the capital city of Japan

WP VBZ DT NN NN IN NNP

NP NP

PP

NP

VP

SQ

SBARQ

WHNP

0

1

2

3

4

5

6

Leve

l

1 2 3

What,is, the

capital,city

of,Japan

(start)

ICML 2005 Chakrabarti 15

Atype guessing accuracy

Qu

est

ion

wh

-wo

rd

#Q

ue

stio

ns

2g

ram

SV

M

2g

ram

+P

erf

ect

2g

ram

+H

eu

rist

ic

2g

ram

+C

RF

what 349 73.6 85.1 79.1 83.1which 11 81.8 90.9 54.5 81.8when 28 100 100 100 100where 27 92.6 88.9 92.6 88.9who 47 97.9 100 100 97.9how* 32 87.5 87.5 90.6 90.6rest 6 66.7 100 66.7 66.7Aggr 79.4 88 82.6 86.2

Question

TrainedCRF

Filter

Informerfeature

generator

Ordinaryfeature

generator

Merge

Linear SVM

Feature vector

Atype

ICML 2005 Chakrabarti 16

DisambiguationNamed entity tagging

Relation tagging

Raw corpus

Annotated corpus

Textindex

Annotationindex

Indexer

Past query workload stats

Ranking engineQuestion

Answer typepredictor

Keywordmatch predictor Response snippets

Wor

dNet

Wik

iped

ia

Fra

meN

et

Kno

wIt

All

Uniform lexicalnetwork provider

1

2

3

4

ICML 2005 Chakrabarti 17

Scoring function for typed search Instance of atype “near” keyword matches

• IR systems: “hard” proximity predicates• Search engines: unknown reward for proximity• XML+IR, XRank: “hard” word containment in subtree

tele

visi

on

wa

s

inve

nte

d in

192

5.

Inve

nto

r

Joh

n B

air

d

wa

s

bor

n

person#n#1

IS-A

CandidateSelectors

Not closest

Question: Whoinvented the television?

Atype: person#n#1

Selectors: invent*,television

)),(()()( asgapdecaysenergyascore iiis

Up to somemaximum

window

ICML 2005 Chakrabarti 18

Learning a scoring function

s

asgapii

isIDFascore ),()(max)(

Assume parametric form for a ranking classifier• Form of IDF, window size, • Can also choose among

decay function forms Question-answer pairs

give partial orders (Joachims 2004)

Recall in top-50,mean reciprocal rank

Recall MRR0.7 0.7 0.370.8 0.77 0.40.9 0.87 0.44

w=40, k=50

w Recall MRR20 0.83 0.430 0.87 0.4240 0.87 0.44

d=0.9, k=50

IDF Recall MRRlin 0.68 0.32log 0.81 0.4

ICML 2005 Chakrabarti 19

Indexing issues Standard IR posting: word {(doc,offsets)}

• word1 near word2 is standard• instance-of(atype) near {word1, word2,…}

WordNet has 80000 atype nodes, 17000 internal, depth > 10• “horse” also indexed as mammal, animal, sports

equipment, chess piece,…• Original corpus 4GB, gzipped corpus 1.3GB,

IR index 0.9GB, full atype index 4.3GB

XML structure indices not designed for fine-grain, word-as-element-node use

ICML 2005 Chakrabarti 20

Exploit skew in query atypes?integer#n#1 100 author#n#1 7location#n#1 78 state#n#2 6person#n#1 77 number#n#1 6city#n#1 20 date#n#1 6name#n#1 10 actor#n#1 6company#n#1 7 movie#n#1 5

Index only a small registered set of atypes R Relax query atype a to generalization g in R Test a response reachability and retain/discard How to pick R? What is a good objective?

• Relaxed query and discarding steps cost extra time• Rare atypes in what, which, and name questions—

long-tailed distribution

ICML 2005 Chakrabarti 21

Approx objective and approach Index space approx Expected query time bloat is approx

Minimize approx index space with an upper bound on bloat (hard, as expected)

Sparseness: queryProb(a) observed to be zero for most a-s in a large taxonomy

Smooth using similarity between atypes

RaatcorpusCoun )(

Ta atcorpusCoun

gtcorpusCoun

agRg )(

)(

,

mina)queryProb(

ICML 2005 Chakrabarti 22

Sample results Index space approximation reasonable Reasonable average query time bloat with

small index space overheads

|R| AveBloat30 5.1647 1.7All 1

Using a

Using g

Queries

Run

time

|R| IndexSize9 409M

13 498M29 911MAll 4300M

ICML 2005 Chakrabarti 23

Summary Entity and relation annotators

• Maturing technology• Unlikely to be perfect for open-domain sources

The future: query paradigms that combine text and annotations• End-user friendly selection and aggregation• Allow uncertainty, exploit redundancy

Can we scale to terabytes of text? Will centralized search engines be feasible? How to federate annotation management?