39
To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay Jain – Columbia University Luis Gravano – Columbia

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

Embed Size (px)

Citation preview

Page 1: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

To Search or to CrawlTowards a Query Optimizer for Text-

Centric Tasks

Panos Ipeirotis ndash New York UniversityEugene Agichtein ndash Microsoft ResearchPranay Jain ndash Columbia UniversityLuis Gravano ndash Columbia University

2

Text-Centric Task I Information Extraction

Information extraction applications extract structured relations from unstructured text

May 19 1995 Atlanta -- The Centers for Disease Control and Prevention which is in the front line of the worlds response to the deadly Ebola epidemic in Zaire is finding itself hard pressed to cope with the crisishellip

Date Disease Name Location

Jan 1995 Malaria Ethiopia

July 1995 Mad Cow Disease UK

Feb 1995 Pneumonia US

May 1995 Ebola Zaire

Information Extraction System

(eg NYUrsquos Proteus)

Disease Outbreaks in The New York Times

Information Extraction tutorial yesterday by AnHai Doan Raghu Ramakrishnan Shivakumar Vaithyanathan

3

Text-Centric Task II Metasearching

Metasearchers create content summaries of databases (words + frequencies) to direct queries appropriately

Friday June 16 NEW YORK (Forbes) - Starbucks Corp may be next on the target list of CSPI a consumer-health group that this week sued the operator of the KFC restaurant chain

Word Frequency

Starbucks 102

consumer 215

soccer 1295

hellip hellip

Content Summary

Extractor

Word Frequency

Starbucks 103

consumer 216

soccer 1295

hellip hellip

Content Summary of Forbescom

4

Text-Centric Task III Focused Resource Discovery

Identify web pages about a given topic (multiple techniques proposed simple classifiers focused crawlers focused queryinghellip)

URL

httpbiologyaboutcom

httpwwwamjbotorg

httpwwwsysbotorg

httpwwwbotanyubcca

Web Page

Classifier

Web Pages about Botany

5

An Abstract View of Text-Centric Tasks Output Tokens

hellipExtraction

System

Text Database

3 Extract output tokens2 Process documents1 Retrieve documents from database

Task Token

Information Extraction Relation Tuple

Database Selection Word (+Frequency)

Focused Crawling Web Page about a Topic

For the rest of the talk

6

Executing a Text-Centric TaskOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based Retrieve and process documents sequentially Index-based Query database (eg [case fatality rate]) retrieve and process documents in results

Unlike the relational world

Indexes are only ldquoapproximaterdquo index is on keywords not on tokens of interest Choice of execution plan affects output completeness (not only speed)

rarrunderlying data distribution dictates what is best

7

Execution Plan CharacteristicsOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens2 Process documents1 Retrieve documents from database

Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)

Question How do we choose the fastest execution plan for reaching

a target recall

ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo

8

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based(Index-based)

9

ScanOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| middot (R + P)

Time for retrieving a document

Question How many documents does Scan retrieve

to reach target recall

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

10

Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with

frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process

After retrieving S documents frequency of token t follows hypergeometric distribution

Recall for token t is the probability that frequency of t in S docs gt 0

t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents

Probability of seeing token t after retrieving S

documentsg(t) = frequency of token t

11

Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo

processes one for each token Overall recall is average recall across

tokens

rarr We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt

Execution time = |Retrieved Docs| middot (R + P)

12

Scan and Filtered ScanOutput Tokens

hellipExtraction

System

Text Database

4 Extract output tokens

3 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes all documents (until reaching target recall)

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)

Execution time = |Retrieved Docs| ( R + F + P)

Time for retrieving a document

Time for filteringa document

Question How many documents does (Filtered) Scan

retrieve to reach target recall

Classifier

2 Filter documents

Time for processing a document

Classifier selectivity (σle1)

σ

filtered

13

Estimating Recall of Filtered ScanModeling Filtered Scan

Analysis similar to Scan Main difference the classifier rejects

documents and Decreases effective database size

from |D| to σ|D| (σ classifier selectivity)

Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM

Documents rejected by classifier decrease effective

database size

Tokens in rejected documents have lower

effective token frequency

14

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 2: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

2

Text-Centric Task I Information Extraction

Information extraction applications extract structured relations from unstructured text

May 19 1995 Atlanta -- The Centers for Disease Control and Prevention which is in the front line of the worlds response to the deadly Ebola epidemic in Zaire is finding itself hard pressed to cope with the crisishellip

Date Disease Name Location

Jan 1995 Malaria Ethiopia

July 1995 Mad Cow Disease UK

Feb 1995 Pneumonia US

May 1995 Ebola Zaire

Information Extraction System

(eg NYUrsquos Proteus)

Disease Outbreaks in The New York Times

Information Extraction tutorial yesterday by AnHai Doan Raghu Ramakrishnan Shivakumar Vaithyanathan

3

Text-Centric Task II Metasearching

Metasearchers create content summaries of databases (words + frequencies) to direct queries appropriately

Friday June 16 NEW YORK (Forbes) - Starbucks Corp may be next on the target list of CSPI a consumer-health group that this week sued the operator of the KFC restaurant chain

Word Frequency

Starbucks 102

consumer 215

soccer 1295

hellip hellip

Content Summary

Extractor

Word Frequency

Starbucks 103

consumer 216

soccer 1295

hellip hellip

Content Summary of Forbescom

4

Text-Centric Task III Focused Resource Discovery

Identify web pages about a given topic (multiple techniques proposed simple classifiers focused crawlers focused queryinghellip)

URL

httpbiologyaboutcom

httpwwwamjbotorg

httpwwwsysbotorg

httpwwwbotanyubcca

Web Page

Classifier

Web Pages about Botany

5

An Abstract View of Text-Centric Tasks Output Tokens

hellipExtraction

System

Text Database

3 Extract output tokens2 Process documents1 Retrieve documents from database

Task Token

Information Extraction Relation Tuple

Database Selection Word (+Frequency)

Focused Crawling Web Page about a Topic

For the rest of the talk

6

Executing a Text-Centric TaskOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based Retrieve and process documents sequentially Index-based Query database (eg [case fatality rate]) retrieve and process documents in results

Unlike the relational world

Indexes are only ldquoapproximaterdquo index is on keywords not on tokens of interest Choice of execution plan affects output completeness (not only speed)

rarrunderlying data distribution dictates what is best

7

Execution Plan CharacteristicsOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens2 Process documents1 Retrieve documents from database

Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)

Question How do we choose the fastest execution plan for reaching

a target recall

ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo

8

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based(Index-based)

9

ScanOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| middot (R + P)

Time for retrieving a document

Question How many documents does Scan retrieve

to reach target recall

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

10

Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with

frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process

After retrieving S documents frequency of token t follows hypergeometric distribution

Recall for token t is the probability that frequency of t in S docs gt 0

t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents

Probability of seeing token t after retrieving S

documentsg(t) = frequency of token t

11

Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo

processes one for each token Overall recall is average recall across

tokens

rarr We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt

Execution time = |Retrieved Docs| middot (R + P)

12

Scan and Filtered ScanOutput Tokens

hellipExtraction

System

Text Database

4 Extract output tokens

3 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes all documents (until reaching target recall)

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)

Execution time = |Retrieved Docs| ( R + F + P)

Time for retrieving a document

Time for filteringa document

Question How many documents does (Filtered) Scan

retrieve to reach target recall

Classifier

2 Filter documents

Time for processing a document

Classifier selectivity (σle1)

σ

filtered

13

Estimating Recall of Filtered ScanModeling Filtered Scan

Analysis similar to Scan Main difference the classifier rejects

documents and Decreases effective database size

from |D| to σ|D| (σ classifier selectivity)

Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM

Documents rejected by classifier decrease effective

database size

Tokens in rejected documents have lower

effective token frequency

14

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 3: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

3

Text-Centric Task II Metasearching

Metasearchers create content summaries of databases (words + frequencies) to direct queries appropriately

Friday June 16 NEW YORK (Forbes) - Starbucks Corp may be next on the target list of CSPI a consumer-health group that this week sued the operator of the KFC restaurant chain

Word Frequency

Starbucks 102

consumer 215

soccer 1295

hellip hellip

Content Summary

Extractor

Word Frequency

Starbucks 103

consumer 216

soccer 1295

hellip hellip

Content Summary of Forbescom

4

Text-Centric Task III Focused Resource Discovery

Identify web pages about a given topic (multiple techniques proposed simple classifiers focused crawlers focused queryinghellip)

URL

httpbiologyaboutcom

httpwwwamjbotorg

httpwwwsysbotorg

httpwwwbotanyubcca

Web Page

Classifier

Web Pages about Botany

5

An Abstract View of Text-Centric Tasks Output Tokens

hellipExtraction

System

Text Database

3 Extract output tokens2 Process documents1 Retrieve documents from database

Task Token

Information Extraction Relation Tuple

Database Selection Word (+Frequency)

Focused Crawling Web Page about a Topic

For the rest of the talk

6

Executing a Text-Centric TaskOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based Retrieve and process documents sequentially Index-based Query database (eg [case fatality rate]) retrieve and process documents in results

Unlike the relational world

Indexes are only ldquoapproximaterdquo index is on keywords not on tokens of interest Choice of execution plan affects output completeness (not only speed)

rarrunderlying data distribution dictates what is best

7

Execution Plan CharacteristicsOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens2 Process documents1 Retrieve documents from database

Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)

Question How do we choose the fastest execution plan for reaching

a target recall

ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo

8

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based(Index-based)

9

ScanOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| middot (R + P)

Time for retrieving a document

Question How many documents does Scan retrieve

to reach target recall

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

10

Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with

frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process

After retrieving S documents frequency of token t follows hypergeometric distribution

Recall for token t is the probability that frequency of t in S docs gt 0

t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents

Probability of seeing token t after retrieving S

documentsg(t) = frequency of token t

11

Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo

processes one for each token Overall recall is average recall across

tokens

rarr We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt

Execution time = |Retrieved Docs| middot (R + P)

12

Scan and Filtered ScanOutput Tokens

hellipExtraction

System

Text Database

4 Extract output tokens

3 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes all documents (until reaching target recall)

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)

Execution time = |Retrieved Docs| ( R + F + P)

Time for retrieving a document

Time for filteringa document

Question How many documents does (Filtered) Scan

retrieve to reach target recall

Classifier

2 Filter documents

Time for processing a document

Classifier selectivity (σle1)

σ

filtered

13

Estimating Recall of Filtered ScanModeling Filtered Scan

Analysis similar to Scan Main difference the classifier rejects

documents and Decreases effective database size

from |D| to σ|D| (σ classifier selectivity)

Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM

Documents rejected by classifier decrease effective

database size

Tokens in rejected documents have lower

effective token frequency

14

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 4: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

4

Text-Centric Task III Focused Resource Discovery

Identify web pages about a given topic (multiple techniques proposed simple classifiers focused crawlers focused queryinghellip)

URL

httpbiologyaboutcom

httpwwwamjbotorg

httpwwwsysbotorg

httpwwwbotanyubcca

Web Page

Classifier

Web Pages about Botany

5

An Abstract View of Text-Centric Tasks Output Tokens

hellipExtraction

System

Text Database

3 Extract output tokens2 Process documents1 Retrieve documents from database

Task Token

Information Extraction Relation Tuple

Database Selection Word (+Frequency)

Focused Crawling Web Page about a Topic

For the rest of the talk

6

Executing a Text-Centric TaskOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based Retrieve and process documents sequentially Index-based Query database (eg [case fatality rate]) retrieve and process documents in results

Unlike the relational world

Indexes are only ldquoapproximaterdquo index is on keywords not on tokens of interest Choice of execution plan affects output completeness (not only speed)

rarrunderlying data distribution dictates what is best

7

Execution Plan CharacteristicsOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens2 Process documents1 Retrieve documents from database

Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)

Question How do we choose the fastest execution plan for reaching

a target recall

ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo

8

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based(Index-based)

9

ScanOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| middot (R + P)

Time for retrieving a document

Question How many documents does Scan retrieve

to reach target recall

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

10

Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with

frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process

After retrieving S documents frequency of token t follows hypergeometric distribution

Recall for token t is the probability that frequency of t in S docs gt 0

t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents

Probability of seeing token t after retrieving S

documentsg(t) = frequency of token t

11

Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo

processes one for each token Overall recall is average recall across

tokens

rarr We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt

Execution time = |Retrieved Docs| middot (R + P)

12

Scan and Filtered ScanOutput Tokens

hellipExtraction

System

Text Database

4 Extract output tokens

3 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes all documents (until reaching target recall)

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)

Execution time = |Retrieved Docs| ( R + F + P)

Time for retrieving a document

Time for filteringa document

Question How many documents does (Filtered) Scan

retrieve to reach target recall

Classifier

2 Filter documents

Time for processing a document

Classifier selectivity (σle1)

σ

filtered

13

Estimating Recall of Filtered ScanModeling Filtered Scan

Analysis similar to Scan Main difference the classifier rejects

documents and Decreases effective database size

from |D| to σ|D| (σ classifier selectivity)

Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM

Documents rejected by classifier decrease effective

database size

Tokens in rejected documents have lower

effective token frequency

14

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 5: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

5

An Abstract View of Text-Centric Tasks Output Tokens

hellipExtraction

System

Text Database

3 Extract output tokens2 Process documents1 Retrieve documents from database

Task Token

Information Extraction Relation Tuple

Database Selection Word (+Frequency)

Focused Crawling Web Page about a Topic

For the rest of the talk

6

Executing a Text-Centric TaskOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based Retrieve and process documents sequentially Index-based Query database (eg [case fatality rate]) retrieve and process documents in results

Unlike the relational world

Indexes are only ldquoapproximaterdquo index is on keywords not on tokens of interest Choice of execution plan affects output completeness (not only speed)

rarrunderlying data distribution dictates what is best

7

Execution Plan CharacteristicsOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens2 Process documents1 Retrieve documents from database

Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)

Question How do we choose the fastest execution plan for reaching

a target recall

ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo

8

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based(Index-based)

9

ScanOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| middot (R + P)

Time for retrieving a document

Question How many documents does Scan retrieve

to reach target recall

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

10

Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with

frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process

After retrieving S documents frequency of token t follows hypergeometric distribution

Recall for token t is the probability that frequency of t in S docs gt 0

t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents

Probability of seeing token t after retrieving S

documentsg(t) = frequency of token t

11

Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo

processes one for each token Overall recall is average recall across

tokens

rarr We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt

Execution time = |Retrieved Docs| middot (R + P)

12

Scan and Filtered ScanOutput Tokens

hellipExtraction

System

Text Database

4 Extract output tokens

3 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes all documents (until reaching target recall)

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)

Execution time = |Retrieved Docs| ( R + F + P)

Time for retrieving a document

Time for filteringa document

Question How many documents does (Filtered) Scan

retrieve to reach target recall

Classifier

2 Filter documents

Time for processing a document

Classifier selectivity (σle1)

σ

filtered

13

Estimating Recall of Filtered ScanModeling Filtered Scan

Analysis similar to Scan Main difference the classifier rejects

documents and Decreases effective database size

from |D| to σ|D| (σ classifier selectivity)

Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM

Documents rejected by classifier decrease effective

database size

Tokens in rejected documents have lower

effective token frequency

14

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 6: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

6

Executing a Text-Centric TaskOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based Retrieve and process documents sequentially Index-based Query database (eg [case fatality rate]) retrieve and process documents in results

Unlike the relational world

Indexes are only ldquoapproximaterdquo index is on keywords not on tokens of interest Choice of execution plan affects output completeness (not only speed)

rarrunderlying data distribution dictates what is best

7

Execution Plan CharacteristicsOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens2 Process documents1 Retrieve documents from database

Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)

Question How do we choose the fastest execution plan for reaching

a target recall

ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo

8

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based(Index-based)

9

ScanOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| middot (R + P)

Time for retrieving a document

Question How many documents does Scan retrieve

to reach target recall

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

10

Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with

frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process

After retrieving S documents frequency of token t follows hypergeometric distribution

Recall for token t is the probability that frequency of t in S docs gt 0

t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents

Probability of seeing token t after retrieving S

documentsg(t) = frequency of token t

11

Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo

processes one for each token Overall recall is average recall across

tokens

rarr We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt

Execution time = |Retrieved Docs| middot (R + P)

12

Scan and Filtered ScanOutput Tokens

hellipExtraction

System

Text Database

4 Extract output tokens

3 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes all documents (until reaching target recall)

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)

Execution time = |Retrieved Docs| ( R + F + P)

Time for retrieving a document

Time for filteringa document

Question How many documents does (Filtered) Scan

retrieve to reach target recall

Classifier

2 Filter documents

Time for processing a document

Classifier selectivity (σle1)

σ

filtered

13

Estimating Recall of Filtered ScanModeling Filtered Scan

Analysis similar to Scan Main difference the classifier rejects

documents and Decreases effective database size

from |D| to σ|D| (σ classifier selectivity)

Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM

Documents rejected by classifier decrease effective

database size

Tokens in rejected documents have lower

effective token frequency

14

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 7: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

7

Execution Plan CharacteristicsOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens2 Process documents1 Retrieve documents from database

Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)

Question How do we choose the fastest execution plan for reaching

a target recall

ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo

8

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based(Index-based)

9

ScanOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| middot (R + P)

Time for retrieving a document

Question How many documents does Scan retrieve

to reach target recall

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

10

Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with

frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process

After retrieving S documents frequency of token t follows hypergeometric distribution

Recall for token t is the probability that frequency of t in S docs gt 0

t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents

Probability of seeing token t after retrieving S

documentsg(t) = frequency of token t

11

Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo

processes one for each token Overall recall is average recall across

tokens

rarr We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt

Execution time = |Retrieved Docs| middot (R + P)

12

Scan and Filtered ScanOutput Tokens

hellipExtraction

System

Text Database

4 Extract output tokens

3 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes all documents (until reaching target recall)

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)

Execution time = |Retrieved Docs| ( R + F + P)

Time for retrieving a document

Time for filteringa document

Question How many documents does (Filtered) Scan

retrieve to reach target recall

Classifier

2 Filter documents

Time for processing a document

Classifier selectivity (σle1)

σ

filtered

13

Estimating Recall of Filtered ScanModeling Filtered Scan

Analysis similar to Scan Main difference the classifier rejects

documents and Decreases effective database size

from |D| to σ|D| (σ classifier selectivity)

Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM

Documents rejected by classifier decrease effective

database size

Tokens in rejected documents have lower

effective token frequency

14

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 8: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

8

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based(Index-based)

9

ScanOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| middot (R + P)

Time for retrieving a document

Question How many documents does Scan retrieve

to reach target recall

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

10

Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with

frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process

After retrieving S documents frequency of token t follows hypergeometric distribution

Recall for token t is the probability that frequency of t in S docs gt 0

t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents

Probability of seeing token t after retrieving S

documentsg(t) = frequency of token t

11

Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo

processes one for each token Overall recall is average recall across

tokens

rarr We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt

Execution time = |Retrieved Docs| middot (R + P)

12

Scan and Filtered ScanOutput Tokens

hellipExtraction

System

Text Database

4 Extract output tokens

3 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes all documents (until reaching target recall)

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)

Execution time = |Retrieved Docs| ( R + F + P)

Time for retrieving a document

Time for filteringa document

Question How many documents does (Filtered) Scan

retrieve to reach target recall

Classifier

2 Filter documents

Time for processing a document

Classifier selectivity (σle1)

σ

filtered

13

Estimating Recall of Filtered ScanModeling Filtered Scan

Analysis similar to Scan Main difference the classifier rejects

documents and Decreases effective database size

from |D| to σ|D| (σ classifier selectivity)

Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM

Documents rejected by classifier decrease effective

database size

Tokens in rejected documents have lower

effective token frequency

14

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 9: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

9

ScanOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| middot (R + P)

Time for retrieving a document

Question How many documents does Scan retrieve

to reach target recall

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

10

Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with

frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process

After retrieving S documents frequency of token t follows hypergeometric distribution

Recall for token t is the probability that frequency of t in S docs gt 0

t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents

Probability of seeing token t after retrieving S

documentsg(t) = frequency of token t

11

Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo

processes one for each token Overall recall is average recall across

tokens

rarr We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt

Execution time = |Retrieved Docs| middot (R + P)

12

Scan and Filtered ScanOutput Tokens

hellipExtraction

System

Text Database

4 Extract output tokens

3 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes all documents (until reaching target recall)

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)

Execution time = |Retrieved Docs| ( R + F + P)

Time for retrieving a document

Time for filteringa document

Question How many documents does (Filtered) Scan

retrieve to reach target recall

Classifier

2 Filter documents

Time for processing a document

Classifier selectivity (σle1)

σ

filtered

13

Estimating Recall of Filtered ScanModeling Filtered Scan

Analysis similar to Scan Main difference the classifier rejects

documents and Decreases effective database size

from |D| to σ|D| (σ classifier selectivity)

Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM

Documents rejected by classifier decrease effective

database size

Tokens in rejected documents have lower

effective token frequency

14

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 10: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

10

Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with

frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process

After retrieving S documents frequency of token t follows hypergeometric distribution

Recall for token t is the probability that frequency of t in S docs gt 0

t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents

Probability of seeing token t after retrieving S

documentsg(t) = frequency of token t

11

Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo

processes one for each token Overall recall is average recall across

tokens

rarr We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt

Execution time = |Retrieved Docs| middot (R + P)

12

Scan and Filtered ScanOutput Tokens

hellipExtraction

System

Text Database

4 Extract output tokens

3 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes all documents (until reaching target recall)

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)

Execution time = |Retrieved Docs| ( R + F + P)

Time for retrieving a document

Time for filteringa document

Question How many documents does (Filtered) Scan

retrieve to reach target recall

Classifier

2 Filter documents

Time for processing a document

Classifier selectivity (σle1)

σ

filtered

13

Estimating Recall of Filtered ScanModeling Filtered Scan

Analysis similar to Scan Main difference the classifier rejects

documents and Decreases effective database size

from |D| to σ|D| (σ classifier selectivity)

Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM

Documents rejected by classifier decrease effective

database size

Tokens in rejected documents have lower

effective token frequency

14

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 11: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

11

Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo

processes one for each token Overall recall is average recall across

tokens

rarr We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt

Execution time = |Retrieved Docs| middot (R + P)

12

Scan and Filtered ScanOutput Tokens

hellipExtraction

System

Text Database

4 Extract output tokens

3 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes all documents (until reaching target recall)

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)

Execution time = |Retrieved Docs| ( R + F + P)

Time for retrieving a document

Time for filteringa document

Question How many documents does (Filtered) Scan

retrieve to reach target recall

Classifier

2 Filter documents

Time for processing a document

Classifier selectivity (σle1)

σ

filtered

13

Estimating Recall of Filtered ScanModeling Filtered Scan

Analysis similar to Scan Main difference the classifier rejects

documents and Decreases effective database size

from |D| to σ|D| (σ classifier selectivity)

Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM

Documents rejected by classifier decrease effective

database size

Tokens in rejected documents have lower

effective token frequency

14

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 12: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

12

Scan and Filtered ScanOutput Tokens

hellipExtraction

System

Text Database

4 Extract output tokens

3 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes all documents (until reaching target recall)

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)

Execution time = |Retrieved Docs| ( R + F + P)

Time for retrieving a document

Time for filteringa document

Question How many documents does (Filtered) Scan

retrieve to reach target recall

Classifier

2 Filter documents

Time for processing a document

Classifier selectivity (σle1)

σ

filtered

13

Estimating Recall of Filtered ScanModeling Filtered Scan

Analysis similar to Scan Main difference the classifier rejects

documents and Decreases effective database size

from |D| to σ|D| (σ classifier selectivity)

Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM

Documents rejected by classifier decrease effective

database size

Tokens in rejected documents have lower

effective token frequency

14

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 13: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

13

Estimating Recall of Filtered ScanModeling Filtered Scan

Analysis similar to Scan Main difference the classifier rejects

documents and Decreases effective database size

from |D| to σ|D| (σ classifier selectivity)

Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM

Documents rejected by classifier decrease effective

database size

Tokens in rejected documents have lower

effective token frequency

14

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 14: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

14

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 15: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

Time for processing a document

Query

Generation

4 Augment seed tokens with new tokens

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall

(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 16: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 17: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 18: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 19: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 20: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database

4 Extract tokensfrom docs

3 Process retrieved documents

2 Query database

Execution time = |Retrieved Docs| (R + P) + |Queries| Q

Time for retrieving a document

Time for answering a query

Time for processing a document

OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 21: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 22: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

22

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 23: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 24: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 25: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 26: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 27: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

27

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 28: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 29: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 30: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 31: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 32: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

32

Thank you

Task Filtered Scan Iterative Set Expansion

Automatic Query Generation

Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003

Agichtein and Gravano ICDE 2003

Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 33: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 34: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system

182531 documents from NYT

16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 35: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 36: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

36

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 37: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

37

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

Underestimated recall for AQG switched to ISE

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 38: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

38

Experimental Results (Information Extraction)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan

Iterative Set Expansion

Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens

Page 39: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay

39

Focused Resource Discovery

Focused Resource Discovery

800000 web pages

12000 tokens