Harvesting Relational Tables from Lists on the Web

Harvesting Relational Tables from Lists Harvesting Relational Tables from Lists on the Webon the Web

Hazem ElmeleegyPurdue University

Jayant Madhavan and Alon HalevyGoogle Inc.

OutlineOutline

Introduction

The ListExtract Approach

Experiments

Conclusion

Lists on the WebLists on the Web




• Our Goal: Extract tabular data from all

such lists in an unsupervised and domain-independent

manner.

• Not the typical wrapper generation problem.

Cartoons ExampleCartoons Example

A period (“.”) is used both as a delimiter and to

terminate abbreviations

A slash (“/”) is used both as a delimiter and as part of the text

The slash (“/”) delimiter is

missing (along with the prod.

year)

• Easy for Humans

• Confusing for Machines

Key ContributionsKey Contributions

Developed the ListExtract System, which extracts tables from lists in an unsupervised and domain-independent manner

Introduced using external sources of information such as a large collection of tables collected from the web and a language model to help in the splitting decisions

Conducted a large-scale experimental study which suggests that tens of millions of high-quality lists can be exploited on the Web.

OutlineOutline

IntroductionIntroduction

The ListExtract Approach

Experiments

Conclusion

ListExtract ApproachListExtract Approach

SplittingLines into Records

Aligning Short Records

(Null Insertion)

Decidingthe Number of Columns

Re-SplittingLong Records

Re-Aligning Detected

Field Streaks (Null Insertion)

Detecting Inconsistent

Fields

Re-Splitting Detected

Field Streaks

Independent Splitting

Phase

AlignmentPhase

Refinement Phase

Intermediate Outputs Intermediate Outputs (Independent Splitting Phase)(Independent Splitting Phase)

1 || What’s Opera Doc || Warner Bros || 1957

2 || Duck Amuck || Warner Bros || 1953

3 || The Band Concert || Disney || 1935

4. Duck Dodgers in the 24 1/2th Century (Warner Bros || 1953

5 || One Froggy Evening || Warner Bros || 1956

6 || Gertie the Dinosaur || McCay

…

17 || Popeye the Sailor || Meets || Sinbad the Sailor || Fletcher || 1936

Intermediate Outputs Intermediate Outputs (Re-Splitting Long Records)(Re-Splitting Long Records)

1 || What’s Opera Doc || Warner Bros || 1957

2 || Duck Amuck || Warner Bros || 1953

3 || The Band Concert || Disney || 1935

4. Duck Dodgers in the 24 1/2th Century (Warner Bros || 1953

5 || One Froggy Evening || Warner Bros || 1956

6 || Gertie the Dinosaur || McCay

…

17. Popeye the Sailor Meets || Sinbad the Sailor || Fletcher || 1936

Number of Columns = 4

Intermediate Outputs Intermediate Outputs

(Alignment Phase)(Alignment Phase)

1 What’s Opera Doc Warner Bros 1957

2 Duck Amuck Warner Bros 1953

3 The Band Concert Disney 1935

4. Duck Dodgers in the 24 1/2th Century (Warner Bros 1953

5 One Froggy Evening Warner Bros 1956

6 Gertie the Dinosaur McCay

… … … …

17. Popeye the Sailor Meets Sinbad the Sailor Fletcher 1936

Final Output Final Output

(Refinement Phase)(Refinement Phase)

1 What’s Opera Doc Warner Bros 1957

2 Duck Amuck Warner Bros 1953


4 Duck Dodgers in the 24 1/2th Century Warner Bros 1953

5 One Froggy Evening Warner Bros 1956

6 Gertie the Dinosaur McCay

… … … …

17 Popeye the Sailor Meets Sinbad the Sailor (Fletcher) 1936




(Null Insertion)

Decidingthe Number of Columns





Fields


Field Streaks


Phase

AlignmentPhase

Refinement Phase

Output

Input

Line Splitting AlgorithmLine Splitting Algorithm

The Band Concert 0.92

The Band 0.89

Disney 0.82

Band Concert 0.65

Disney 1935 0.51

1935 0.34

... …

3 0.15

Band Concert Disney 0.12

3 The Band 0.07

Concert Disney 1935 0.03


0.01

√

The Band Concert

The Band Concert Disney



3. The Band Concert (Disney /1935)

pre-processing: (removing delimiters)

SubsequenceFQ

Score

√

√

√

Field Quality (FQ) ScoreField Quality (FQ) Score

Linear Combination of multiple score componentsEach component corresponds to a source of evidence

Score Components1. Data Type

Regular expressions to capture different data types (e.g. dates, emails, currencies, … etc)

Score: 1 if match found, 0 otherwise

2. Table CorpusCheck if candidate sequence existed as a field in the table corpusScore: 1 if exists, 0 otherwise

3. Language ModelMeasure the likelihood that candidate sequence occurs in free text, and the unlikelihood that overlapping sequences occur in free text.Score: a combination of the probabilities capturing both the likelihood and unlikelihood




(Null Insertion)

Decide on the Number of Columns





Fields


Field Streaks


Phase

AlignmentPhase

Refinement Phase

Majority Voting across all records




(Null Insertion)






Fields


Field Streaks


Phase

AlignmentPhase

Refinement Phase

Output

Input

Re-Splitting Long RecordsRe-Splitting Long Records

The Band Concert 0.92

The Band 0.89

Disney 0.82

Band Concert 0.65

Disney 1935 0.51

1935 0.34

... …

3 0.15

Band Concert Disney 0.12

3 The Band 0.07

Concert Disney 1935 0.03


0.01

√

The Band Concert

3 The Band Concert Disney / 1935


3. The Band Concert (Disney /1935)

pre-processing: (removing delimiters)

SubsequenceFQ

Score

√

√

√

Maximum Number of Output Fields = 3




(Null Insertion)






Fields


Field Streaks


Phase

AlignmentPhase

Refinement Phase

Aligning Short RecordsAligning Short Records

(Null Insertion)(Null Insertion)

... ... ...

... ... ... ...

... ...

... ... ... ...

Avg. FQScore

... ... ...

... ... ... ...

... ... ... ...

0.88

0.79

0.49

0.62

0.73

0.92

0.86

Independently SplitRecords



… … … …

NULL

… … … …

… … … …

… … … …

… …

…

…

… … NULL

… … NULL NULL

Avg. FQScore

0.92

0.86

0.79

0.62

0.88

0.73

0.49

Independently SplitRecords

Output Table

1- Sorting 2- Iterative Alignment



To align each record, we use the classical Needleman-Wunsch Sequence Alignment algortihm.

[NW, J. of Molecular Biology, 1970]

The two sequences: Sequence #1: Table columnsSequence #2: Fields of a short record

Design a Field-to-Field Consistency (F2FC) Score.

Use the average F2FC Score as the similarity measure for the alignment algorithm.

Field-to-Field Consistency Field-to-Field Consistency

(F2FC) Score(F2FC) Score

Linear combination of multiple score componentsEach component corresponds to a source of evidence


Check if data types are consistent

2. Table CorpusCheck if two fields co-occur in the same column in a table in the corpus

3. SyntaxMeasure the consistency of the syntax of the two fields

(e.g. length, % of upper/lower case letters, digits, spaces, etc)

4. DelimitersMeasures the consistency between the delimiters on both sides of the two fields




(Null Insertion)






Fields


Field Streaks


Phase

AlignmentPhase

Refinement Phase

Refinement PhaseRefinement Phase

… … … … … …

… … … … … …

… … … … … …

… … … … … …

… … … … … …

… … … … … …

Output Table


… X … … … …

… … … … … X

… … X X X …

… … … … … …

… X X … … …

… … … … X …

Detect Inconsistent FieldsOutput Table


… … … … … …

… … … … … …

… … X X X …

… … … … … …

… X X … … …

… … … … … …

Detect Inconsistent Fields

Consider streaks only

Output Table


… … … … … …

… … … … … …

… … … …

… … … … … …

… … … … …

… … … … … …



Re-merge

Output Table


… … … … … …

… … … … … …

… … √ √ √ …

… … … … … …

… √ √ … … …

… … … … … …



Re-merge

Re-split (and re-align if needed)

Use extended FQ score

Output Table

Field Quality (FQ) ScoreField Quality (FQ) Score

[Revisited][Revisited]

Linear Combination of multiple score componentsEach component corresponds to a source of evidence


2. Table Corpus

3. Language Model

4. List Support • favors candidates which are more consistent with the columns spanned by

the streak




(Null Insertion)






Fields


Field Streaks


Phase

AlignmentPhase

Refinement Phase

Table Extraction (TE) ScoreTable Extraction (TE) Score

Average FQ Score for all fields in the extracted table

Used to compare between and rank the extracted tables based on their extraction quality

OutlineOutline


The ListExtract ApproachThe ListExtract Approach

Experiments

Conclusion

0.5

0.6

0.7

0.8

0.9

1

20 40 60 80 100

Top percentage of extracted tables

F-m

ea

su

re

Wlists TDLists

Overall Performance for Overall Performance for WLists and TDListsWLists and TDLists

WLists: A set of 20 manually-collected HTML lists spanning 20 different domains.

TDLists: A set of 100 lists derived from randomly-selected HTML tables

Effect of the Refinement PhaseEffect of the Refinement Phase

(WLists)(WLists)

0.5

0.6

0.7

0.8

0.9

1

20 40 60 80 100

Top percentage of extracted tables

F-m

ea

su

reRefinement No Refinement

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Table Extraction Score

Nu

mb

er

of

Ex

tra

cte

d T

ab

les

Large-Scale ExperimentLarge-Scale Experiment

A crawl of 100K web pages

100K extracted lists

32K lists after filtering

11K extracted tables with multiple columns

(0.65, ~1,000 tables)

(0.45, ~10,300 tables)

OutlineOutline


The ListExtract ApproachThe ListExtract Approach

ExperimentsExperiments

Conclusion

ConclusionConclusion

Our work is a continuation of the efforts to extract structured data from the Web.

Our system, ListExtract, is completely unsupervised and does not assume any domain knowledge. It uses multiple sources of information to make its decisions.

Our results validate the quality of table extraction and suggest that a large number of high-quality lists can be exploited on the Web.

Thank youThank you

Questions?

Documents

Harvesting Relational Tables from Lists on the Web