41
Harvesting Relational Tables Harvesting Relational Tables from Lists on the Web from Lists on the Web Hazem Elmeleegy Purdue University Jayant Madhavan and Alon Halevy Google Inc.

Harvesting Relational Tables from Lists on the Web

  • Upload
    lynton

  • View
    90

  • Download
    0

Embed Size (px)

DESCRIPTION

Harvesting Relational Tables from Lists on the Web. Hazem Elmeleegy Purdue University Jayant Madhavan and Alon Halevy Google Inc. Outline. Introduction The ListExtract Approach Experiments Conclusion. Lists on the Web. Lists on the Web. Lists on the Web. Lists on the Web. Our Goal: - PowerPoint PPT Presentation

Citation preview

Page 1: Harvesting Relational Tables from Lists on the Web

Harvesting Relational Tables from Lists Harvesting Relational Tables from Lists on the Webon the Web

Hazem ElmeleegyPurdue University

Jayant Madhavan and Alon HalevyGoogle Inc.

Page 2: Harvesting Relational Tables from Lists on the Web

OutlineOutline

Introduction

The ListExtract Approach

Experiments

Conclusion

Page 3: Harvesting Relational Tables from Lists on the Web

Lists on the WebLists on the Web

Page 4: Harvesting Relational Tables from Lists on the Web

Lists on the WebLists on the Web

Page 5: Harvesting Relational Tables from Lists on the Web

Lists on the WebLists on the Web

Page 6: Harvesting Relational Tables from Lists on the Web

Lists on the WebLists on the Web

• Our Goal: Extract tabular data from all

such lists in an unsupervised and domain-independent

manner.

• Not the typical wrapper generation problem.

Page 7: Harvesting Relational Tables from Lists on the Web

Cartoons ExampleCartoons Example

A period (“.”) is used both as a delimiter and to

terminate abbreviations

A slash (“/”) is used both as a delimiter and as part of the text

The slash (“/”) delimiter is

missing (along with the prod.

year)

• Easy for Humans

• Confusing for Machines

Page 8: Harvesting Relational Tables from Lists on the Web

Key ContributionsKey Contributions

Developed the ListExtract System, which extracts tables from lists in an unsupervised and domain-independent manner

Introduced using external sources of information such as a large collection of tables collected from the web and a language model to help in the splitting decisions

Conducted a large-scale experimental study which suggests that tens of millions of high-quality lists can be exploited on the Web.

Page 9: Harvesting Relational Tables from Lists on the Web

OutlineOutline

IntroductionIntroduction

The ListExtract Approach

Experiments

Conclusion

Page 10: Harvesting Relational Tables from Lists on the Web

ListExtract ApproachListExtract Approach

SplittingLines into Records

Aligning Short Records

(Null Insertion)

Decidingthe Number of Columns

Re-SplittingLong Records

Re-Aligning Detected

Field Streaks (Null Insertion)

Detecting Inconsistent

Fields

Re-Splitting Detected

Field Streaks

Independent Splitting

Phase

AlignmentPhase

Refinement Phase

Page 11: Harvesting Relational Tables from Lists on the Web

Intermediate Outputs Intermediate Outputs (Independent Splitting Phase)(Independent Splitting Phase)

1 || What’s Opera Doc || Warner Bros || 1957

2 || Duck Amuck || Warner Bros || 1953

3 || The Band Concert || Disney || 1935

4. Duck Dodgers in the 24 1/2th Century (Warner Bros || 1953

5 || One Froggy Evening || Warner Bros || 1956

6 || Gertie the Dinosaur || McCay

17 || Popeye the Sailor || Meets || Sinbad the Sailor || Fletcher || 1936

Page 12: Harvesting Relational Tables from Lists on the Web

Intermediate Outputs Intermediate Outputs (Re-Splitting Long Records)(Re-Splitting Long Records)

1 || What’s Opera Doc || Warner Bros || 1957

2 || Duck Amuck || Warner Bros || 1953

3 || The Band Concert || Disney || 1935

4. Duck Dodgers in the 24 1/2th Century (Warner Bros || 1953

5 || One Froggy Evening || Warner Bros || 1956

6 || Gertie the Dinosaur || McCay

17. Popeye the Sailor Meets || Sinbad the Sailor || Fletcher || 1936

Number of Columns = 4

Page 13: Harvesting Relational Tables from Lists on the Web

Intermediate Outputs Intermediate Outputs

(Alignment Phase)(Alignment Phase)

1 What’s Opera Doc Warner Bros 1957

2 Duck Amuck Warner Bros 1953

3 The Band Concert Disney 1935

4. Duck Dodgers in the 24 1/2th Century (Warner Bros 1953

5 One Froggy Evening Warner Bros 1956

6 Gertie the Dinosaur McCay

… … … …

17. Popeye the Sailor Meets Sinbad the Sailor Fletcher 1936

Page 14: Harvesting Relational Tables from Lists on the Web

Final Output Final Output

(Refinement Phase)(Refinement Phase)

1 What’s Opera Doc Warner Bros 1957

2 Duck Amuck Warner Bros 1953

3 The Band Concert Disney 1935

4 Duck Dodgers in the 24 1/2th Century Warner Bros 1953

5 One Froggy Evening Warner Bros 1956

6 Gertie the Dinosaur McCay

… … … …

17 Popeye the Sailor Meets Sinbad the Sailor (Fletcher) 1936

Page 15: Harvesting Relational Tables from Lists on the Web

ListExtract ApproachListExtract Approach

SplittingLines into Records

Aligning Short Records

(Null Insertion)

Decidingthe Number of Columns

Re-SplittingLong Records

Re-Aligning Detected

Field Streaks (Null Insertion)

Detecting Inconsistent

Fields

Re-Splitting Detected

Field Streaks

Independent Splitting

Phase

AlignmentPhase

Refinement Phase

Page 16: Harvesting Relational Tables from Lists on the Web

Output

Input

Line Splitting AlgorithmLine Splitting Algorithm

The Band Concert 0.92

The Band 0.89

Disney 0.82

Band Concert 0.65

Disney 1935 0.51

1935 0.34

... …

3 0.15

Band Concert Disney 0.12

3 The Band 0.07

Concert Disney 1935 0.03

3 The Band Concert Disney 1935

0.01

The Band Concert

The Band Concert Disney

3 The Band Concert Disney 1935

3 The Band Concert Disney 1935

3. The Band Concert (Disney /1935)

pre-processing: (removing delimiters)

SubsequenceFQ

Score

Page 17: Harvesting Relational Tables from Lists on the Web

Field Quality (FQ) ScoreField Quality (FQ) Score

Linear Combination of multiple score componentsEach component corresponds to a source of evidence

Score Components1. Data Type

Regular expressions to capture different data types (e.g. dates, emails, currencies, … etc)

Score: 1 if match found, 0 otherwise

2. Table CorpusCheck if candidate sequence existed as a field in the table corpusScore: 1 if exists, 0 otherwise

3. Language ModelMeasure the likelihood that candidate sequence occurs in free text, and the unlikelihood that overlapping sequences occur in free text.Score: a combination of the probabilities capturing both the likelihood and unlikelihood

Page 18: Harvesting Relational Tables from Lists on the Web

ListExtract ApproachListExtract Approach

SplittingLines into Records

Aligning Short Records

(Null Insertion)

Decide on the Number of Columns

Re-SplittingLong Records

Re-Aligning Detected

Field Streaks (Null Insertion)

Detecting Inconsistent

Fields

Re-Splitting Detected

Field Streaks

Independent Splitting

Phase

AlignmentPhase

Refinement Phase

Majority Voting across all records

Page 19: Harvesting Relational Tables from Lists on the Web

ListExtract ApproachListExtract Approach

SplittingLines into Records

Aligning Short Records

(Null Insertion)

Decide on the Number of Columns

Re-SplittingLong Records

Re-Aligning Detected

Field Streaks (Null Insertion)

Detecting Inconsistent

Fields

Re-Splitting Detected

Field Streaks

Independent Splitting

Phase

AlignmentPhase

Refinement Phase

Page 20: Harvesting Relational Tables from Lists on the Web

Output

Input

Re-Splitting Long RecordsRe-Splitting Long Records

The Band Concert 0.92

The Band 0.89

Disney 0.82

Band Concert 0.65

Disney 1935 0.51

1935 0.34

... …

3 0.15

Band Concert Disney 0.12

3 The Band 0.07

Concert Disney 1935 0.03

3 The Band Concert Disney 1935

0.01

The Band Concert

3 The Band Concert Disney / 1935

3 The Band Concert Disney 1935

3. The Band Concert (Disney /1935)

pre-processing: (removing delimiters)

SubsequenceFQ

Score

Maximum Number of Output Fields = 3

Page 21: Harvesting Relational Tables from Lists on the Web

ListExtract ApproachListExtract Approach

SplittingLines into Records

Aligning Short Records

(Null Insertion)

Decide on the Number of Columns

Re-SplittingLong Records

Re-Aligning Detected

Field Streaks (Null Insertion)

Detecting Inconsistent

Fields

Re-Splitting Detected

Field Streaks

Independent Splitting

Phase

AlignmentPhase

Refinement Phase

Page 22: Harvesting Relational Tables from Lists on the Web

Aligning Short RecordsAligning Short Records

(Null Insertion)(Null Insertion)

... ... ...

... ... ... ...

... ...

... ... ... ...

Avg. FQScore

... ... ...

... ... ... ...

... ... ... ...

0.88

0.79

0.49

0.62

0.73

0.92

0.86

Independently SplitRecords

Page 23: Harvesting Relational Tables from Lists on the Web

Aligning Short RecordsAligning Short Records

(Null Insertion)(Null Insertion)

… … … …

NULL

… … … …

… … … …

… … … …

… …

… … NULL

… … NULL NULL

Avg. FQScore

0.92

0.86

0.79

0.62

0.88

0.73

0.49

Independently SplitRecords

Output Table

1- Sorting 2- Iterative Alignment

Page 24: Harvesting Relational Tables from Lists on the Web

Aligning Short RecordsAligning Short Records

(Null Insertion)(Null Insertion)

To align each record, we use the classical Needleman-Wunsch Sequence Alignment algortihm.

[NW, J. of Molecular Biology, 1970]

The two sequences: Sequence #1: Table columnsSequence #2: Fields of a short record

Design a Field-to-Field Consistency (F2FC) Score.

Use the average F2FC Score as the similarity measure for the alignment algorithm.

Page 25: Harvesting Relational Tables from Lists on the Web

Field-to-Field Consistency Field-to-Field Consistency

(F2FC) Score(F2FC) Score

Linear combination of multiple score componentsEach component corresponds to a source of evidence

Score Components1. Data Type

Check if data types are consistent

2. Table CorpusCheck if two fields co-occur in the same column in a table in the corpus

3. SyntaxMeasure the consistency of the syntax of the two fields

(e.g. length, % of upper/lower case letters, digits, spaces, etc)

4. DelimitersMeasures the consistency between the delimiters on both sides of the two fields

Page 26: Harvesting Relational Tables from Lists on the Web

ListExtract ApproachListExtract Approach

SplittingLines into Records

Aligning Short Records

(Null Insertion)

Decide on the Number of Columns

Re-SplittingLong Records

Re-Aligning Detected

Field Streaks (Null Insertion)

Detecting Inconsistent

Fields

Re-Splitting Detected

Field Streaks

Independent Splitting

Phase

AlignmentPhase

Refinement Phase

Page 27: Harvesting Relational Tables from Lists on the Web

Refinement PhaseRefinement Phase

… … … … … …

… … … … … …

… … … … … …

… … … … … …

… … … … … …

… … … … … …

Output Table

Page 28: Harvesting Relational Tables from Lists on the Web

Refinement PhaseRefinement Phase

… X … … … …

… … … … … X

… … X X X …

… … … … … …

… X X … … …

… … … … X …

Detect Inconsistent FieldsOutput Table

Page 29: Harvesting Relational Tables from Lists on the Web

Refinement PhaseRefinement Phase

… … … … … …

… … … … … …

… … X X X …

… … … … … …

… X X … … …

… … … … … …

Detect Inconsistent Fields

Consider streaks only

Output Table

Page 30: Harvesting Relational Tables from Lists on the Web

Refinement PhaseRefinement Phase

… … … … … …

… … … … … …

… … … …

… … … … … …

… … … … …

… … … … … …

Detect Inconsistent Fields

Consider streaks only

Re-merge

Output Table

Page 31: Harvesting Relational Tables from Lists on the Web

Refinement PhaseRefinement Phase

… … … … … …

… … … … … …

… … √ √ √ …

… … … … … …

… √ √ … … …

… … … … … …

Detect Inconsistent Fields

Consider streaks only

Re-merge

Re-split (and re-align if needed)

Use extended FQ score

Output Table

Page 32: Harvesting Relational Tables from Lists on the Web

Field Quality (FQ) ScoreField Quality (FQ) Score

[Revisited][Revisited]

Linear Combination of multiple score componentsEach component corresponds to a source of evidence

Score Components1. Data Type

2. Table Corpus

3. Language Model

4. List Support • favors candidates which are more consistent with the columns spanned by

the streak

Page 33: Harvesting Relational Tables from Lists on the Web

ListExtract ApproachListExtract Approach

SplittingLines into Records

Aligning Short Records

(Null Insertion)

Decide on the Number of Columns

Re-SplittingLong Records

Re-Aligning Detected

Field Streaks (Null Insertion)

Detecting Inconsistent

Fields

Re-Splitting Detected

Field Streaks

Independent Splitting

Phase

AlignmentPhase

Refinement Phase

Page 34: Harvesting Relational Tables from Lists on the Web

Table Extraction (TE) ScoreTable Extraction (TE) Score

Average FQ Score for all fields in the extracted table

Used to compare between and rank the extracted tables based on their extraction quality

Page 35: Harvesting Relational Tables from Lists on the Web

OutlineOutline

IntroductionIntroduction

The ListExtract ApproachThe ListExtract Approach

Experiments

Conclusion

Page 36: Harvesting Relational Tables from Lists on the Web

0.5

0.6

0.7

0.8

0.9

1

20 40 60 80 100

Top percentage of extracted tables

F-m

ea

su

re

Wlists TDLists

Overall Performance for Overall Performance for WLists and TDListsWLists and TDLists

WLists: A set of 20 manually-collected HTML lists spanning 20 different domains.

TDLists: A set of 100 lists derived from randomly-selected HTML tables

Page 37: Harvesting Relational Tables from Lists on the Web

Effect of the Refinement PhaseEffect of the Refinement Phase

(WLists)(WLists)

0.5

0.6

0.7

0.8

0.9

1

20 40 60 80 100

Top percentage of extracted tables

F-m

ea

su

reRefinement No Refinement

Page 38: Harvesting Relational Tables from Lists on the Web

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Table Extraction Score

Nu

mb

er

of

Ex

tra

cte

d T

ab

les

Large-Scale ExperimentLarge-Scale Experiment

A crawl of 100K web pages

100K extracted lists

32K lists after filtering

11K extracted tables with multiple columns

(0.65, ~1,000 tables)

(0.45, ~10,300 tables)

Page 39: Harvesting Relational Tables from Lists on the Web

OutlineOutline

IntroductionIntroduction

The ListExtract ApproachThe ListExtract Approach

ExperimentsExperiments

Conclusion

Page 40: Harvesting Relational Tables from Lists on the Web

ConclusionConclusion

Our work is a continuation of the efforts to extract structured data from the Web.

Our system, ListExtract, is completely unsupervised and does not assume any domain knowledge. It uses multiple sources of information to make its decisions.

Our results validate the quality of table extraction and suggest that a large number of high-quality lists can be exploited on the Web.

Page 41: Harvesting Relational Tables from Lists on the Web

Thank youThank you

Questions?