Harvesting Relational Tables from Lists on the Web

Harvesting Relational Tables from Lists Harvesting Relational Tables from Lists on the Webon the Web

Hazem ElmeleegyPurdue University

Jayant Madhavan and Alon HalevyGoogle Inc.

OutlineOutline

Introduction

The ListExtract Approach

Experiments

Conclusion

Lists on the WebLists on the Web

• Our Goal: Extract tabular data from all

such lists in an unsupervised and domain-independent

manner.

• Not the typical wrapper generation problem.

Cartoons ExampleCartoons Example

A period (“.”) is used both as a delimiter and to

terminate abbreviations

A slash (“/”) is used both as a delimiter and as part of the text

The slash (“/”) delimiter is

missing (along with the prod.

• Easy for Humans

• Confusing for Machines

Key ContributionsKey Contributions

Developed the ListExtract System, which extracts tables from lists in an unsupervised and domain-independent manner

Introduced using external sources of information such as a large collection of tables collected from the web and a language model to help in the splitting decisions

Conducted a large-scale experimental study which suggests that tens of millions of high-quality lists can be exploited on the Web.

OutlineOutline

IntroductionIntroduction

The ListExtract Approach

Experiments

Conclusion

ListExtract ApproachListExtract Approach

SplittingLines into Records

Aligning Short Records

(Null Insertion)

Decidingthe Number of Columns

Re-SplittingLong Records

Re-Aligning Detected

Field Streaks (Null Insertion)

Detecting Inconsistent

Fields

Re-Splitting Detected

Field Streaks

Independent Splitting

AlignmentPhase

Refinement Phase

Intermediate Outputs Intermediate Outputs (Independent Splitting Phase)(Independent Splitting Phase)

1 || What’s Opera Doc || Warner Bros || 1957

2 || Duck Amuck || Warner Bros || 1953

3 || The Band Concert || Disney || 1935

4. Duck Dodgers in the 24 1/2th Century (Warner Bros || 1953

5 || One Froggy Evening || Warner Bros || 1956

6 || Gertie the Dinosaur || McCay

17 || Popeye the Sailor || Meets || Sinbad the Sailor || Fletcher || 1936

Intermediate Outputs Intermediate Outputs (Re-Splitting Long Records)(Re-Splitting Long Records)

1 || What’s Opera Doc || Warner Bros || 1957

2 || Duck Amuck || Warner Bros || 1953

3 || The Band Concert || Disney || 1935

4. Duck Dodgers in the 24 1/2th Century (Warner Bros || 1953

5 || One Froggy Evening || Warner Bros || 1956

6 || Gertie the Dinosaur || McCay

17. Popeye the Sailor Meets || Sinbad the Sailor || Fletcher || 1936

Number of Columns = 4

Intermediate Outputs Intermediate Outputs

(Alignment Phase)(Alignment Phase)

1 What’s Opera Doc Warner Bros 1957

2 Duck Amuck Warner Bros 1953

3 The Band Concert Disney 1935

4. Duck Dodgers in the 24 1/2th Century (Warner Bros 1953

5 One Froggy Evening Warner Bros 1956

6 Gertie the Dinosaur McCay

… … … …

17. Popeye the Sailor Meets Sinbad the Sailor Fletcher 1936

Final Output Final Output

(Refinement Phase)(Refinement Phase)

1 What’s Opera Doc Warner Bros 1957

2 Duck Amuck Warner Bros 1953

4 Duck Dodgers in the 24 1/2th Century Warner Bros 1953

5 One Froggy Evening Warner Bros 1956

6 Gertie the Dinosaur McCay

… … … …

17 Popeye the Sailor Meets Sinbad the Sailor (Fletcher) 1936

(Null Insertion)

Decidingthe Number of Columns

Fields

Field Streaks

AlignmentPhase

Refinement Phase

Output

Line Splitting AlgorithmLine Splitting Algorithm

The Band Concert 0.92

The Band 0.89

Disney 0.82

Band Concert 0.65

Disney 1935 0.51

1935 0.34

... …

3 0.15

Band Concert Disney 0.12

3 The Band 0.07

Concert Disney 1935 0.03

The Band Concert

The Band Concert Disney

3. The Band Concert (Disney /1935)

pre-processing: (removing delimiters)

SubsequenceFQ

Field Quality (FQ) ScoreField Quality (FQ) Score

Linear Combination of multiple score componentsEach component corresponds to a source of evidence

Score Components1. Data Type

Regular expressions to capture different data types (e.g. dates, emails, currencies, … etc)

Score: 1 if match found, 0 otherwise

2. Table CorpusCheck if candidate sequence existed as a field in the table corpusScore: 1 if exists, 0 otherwise

3. Language ModelMeasure the likelihood that candidate sequence occurs in free text, and the unlikelihood that overlapping sequences occur in free text.Score: a combination of the probabilities capturing both the likelihood and unlikelihood

(Null Insertion)

Decide on the Number of Columns

Fields

Field Streaks

AlignmentPhase

Refinement Phase

Majority Voting across all records

(Null Insertion)

Fields

Field Streaks

AlignmentPhase

Refinement Phase

Output

Re-Splitting Long RecordsRe-Splitting Long Records

The Band Concert 0.92

The Band 0.89

Disney 0.82

Band Concert 0.65

Disney 1935 0.51

1935 0.34

... …

3 0.15

Band Concert Disney 0.12

3 The Band 0.07

Concert Disney 1935 0.03

The Band Concert

3 The Band Concert Disney / 1935

3. The Band Concert (Disney /1935)

pre-processing: (removing delimiters)

SubsequenceFQ

Maximum Number of Output Fields = 3

(Null Insertion)

Fields

Field Streaks

AlignmentPhase

Refinement Phase

Aligning Short RecordsAligning Short Records

(Null Insertion)(Null Insertion)

... ... ...

... ... ... ...

... ...

... ... ... ...

Avg. FQScore

... ... ...

... ... ... ...

Independently SplitRecords

… … … …

… …

… … NULL

… … NULL NULL

Avg. FQScore

Independently SplitRecords

Output Table

1- Sorting 2- Iterative Alignment

To align each record, we use the classical Needleman-Wunsch Sequence Alignment algortihm.

[NW, J. of Molecular Biology, 1970]

The two sequences: Sequence #1: Table columnsSequence #2: Fields of a short record

Design a Field-to-Field Consistency (F2FC) Score.

Use the average F2FC Score as the similarity measure for the alignment algorithm.

Field-to-Field Consistency Field-to-Field Consistency

(F2FC) Score(F2FC) Score

Linear combination of multiple score componentsEach component corresponds to a source of evidence

Check if data types are consistent

2. Table CorpusCheck if two fields co-occur in the same column in a table in the corpus

3. SyntaxMeasure the consistency of the syntax of the two fields

(e.g. length, % of upper/lower case letters, digits, spaces, etc)

4. DelimitersMeasures the consistency between the delimiters on both sides of the two fields

(Null Insertion)

Fields

Field Streaks

AlignmentPhase

Refinement Phase

Refinement PhaseRefinement Phase

… … … … … …

Output Table

… X … … … …

… … … … … X

… … X X X …

… … … … … …

… X X … … …

… … … … X …

Detect Inconsistent FieldsOutput Table

… … … … … …

… … X X X …

… … … … … …

… X X … … …

… … … … … …

Detect Inconsistent Fields

Consider streaks only

Output Table

… … … … … …

… … … …

… … … … … …

… … … … …

… … … … … …

Re-merge

Output Table

… … … … … …

… … √ √ √ …

… … … … … …

… √ √ … … …

… … … … … …

Re-merge

Re-split (and re-align if needed)

Use extended FQ score

Output Table

Field Quality (FQ) ScoreField Quality (FQ) Score

[Revisited][Revisited]

Linear Combination of multiple score componentsEach component corresponds to a source of evidence

2. Table Corpus

3. Language Model

4. List Support • favors candidates which are more consistent with the columns spanned by

the streak

(Null Insertion)

Fields

Field Streaks

AlignmentPhase

Refinement Phase

Table Extraction (TE) ScoreTable Extraction (TE) Score

Average FQ Score for all fields in the extracted table

Used to compare between and rank the extracted tables based on their extraction quality

OutlineOutline

The ListExtract ApproachThe ListExtract Approach

Experiments

Conclusion

20 40 60 80 100

Top percentage of extracted tables

Wlists TDLists

Overall Performance for Overall Performance for WLists and TDListsWLists and TDLists

WLists: A set of 20 manually-collected HTML lists spanning 20 different domains.

TDLists: A set of 100 lists derived from randomly-selected HTML tables

Effect of the Refinement PhaseEffect of the Refinement Phase

(WLists)(WLists)

20 40 60 80 100

Top percentage of extracted tables

reRefinement No Refinement

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Table Extraction Score

Large-Scale ExperimentLarge-Scale Experiment

A crawl of 100K web pages

100K extracted lists

32K lists after filtering

11K extracted tables with multiple columns

(0.65, ~1,000 tables)

(0.45, ~10,300 tables)

OutlineOutline

The ListExtract ApproachThe ListExtract Approach

ExperimentsExperiments

Conclusion

ConclusionConclusion

Our work is a continuation of the efforts to extract structured data from the Web.

Our system, ListExtract, is completely unsupervised and does not assume any domain knowledge. It uses multiple sources of information to make its decisions.

Our results validate the quality of table extraction and suggest that a large number of high-quality lists can be exploited on the Web.

Thank youThank you

Questions?

Harvesting Relational Tables from Lists on the Web

Documents

Relational Algebra. 2 Outline Relational Algebra Unary Relational Operations Relational Algebra Operations from Set Theory Binary Relational Operations

Relational Model dww-database system. Outline Relational Model Concepts Relational Model Constraints & Relational Database Schemas Update Operation &

Chapter 3: Relational Model - Bucknell Universityxmeng/Course/CS379/db-relational.pdf · Chapter 3: Relational Model Structure of Relational Databases Relational Algebra Tuple Relational

Drought Planning and Rainwater Harvesting for Arid-zone Pastor a Lists -Bruins Et Al- 2003 Monograph - NIRP 17

Relational Model. 2 Structure of Relational Databases Fundamental Relational-Algebra-Operations Additional Relational-Algebra-Operations Extended Relational-Algebra-Operations

1 The Relational Data Model, Relational Constraints, and The Relational Algebra

Relational Algebra Example Database Application (COMPANY) Relational Algebra –Unary Relational Operations –Relational Algebra Operations From Set Theory

Chapter 2: Relational Model - WordPress.com · Chapter 2: Relational Model Structure of Relational Databases Fundamental Relational-Algebra-Operations Additional Relational-Algebra-Operations

Chapter 6: Formal Relational Query Languages. 6.2 Chapter 6: Formal Relational Query Languages Relational Algebra Tuple Relational Calculus Domain Relational

Relational Model & Relational Algebra

Energy Harvesting, tekhnologi harvesting

The Relational Data Model, Relational Constraints, and The Relational Algebra

Databases Unit 2 Relational data model and relational ...€¦ · Relational model and relational algebra Relational data model Relational algebra. Databases Relational data model

Chapters 3-6 Relational Data Models, Relational Constraints, and Relational Algebra

Relational vs. Non-Relational

Relational Models - uni-muenchen.detresp/papers... · Relational learning, statistical relational models, statistical relational learning, relational data mining 2 Glossary Entities

Databases Unit 2 Relational data model and relational ...dbdmg.polito.it/wordpress/wp-content/uploads/2013/... · Relational data model and relational algebra DBMG Relational model

Chapter 4: Relational Model€¦ · Chapter 4: Relational Model Content: •Relational model •How to transform ER diagrams into a relational model Next: •Transform relational

Harvesting Relational Tables from Lists on the Webkanza/dbseminar/2011/Harvesting.pdf · Why is it Hard? Lets Look at an Example... looking for the word “Ella” in wikipedia Ella

Harvesting Relational Tables from Lists on the Web