View
91
Download
0
Category
Preview:
DESCRIPTION
Harvesting Relational Tables from Lists on the Web. Hazem Elmeleegy Purdue University Jayant Madhavan and Alon Halevy Google Inc. Outline. Introduction The ListExtract Approach Experiments Conclusion. Lists on the Web. Lists on the Web. Lists on the Web. Lists on the Web. Our Goal: - PowerPoint PPT Presentation
Citation preview
Harvesting Relational Tables from Lists Harvesting Relational Tables from Lists on the Webon the Web
Hazem ElmeleegyPurdue University
Jayant Madhavan and Alon HalevyGoogle Inc.
OutlineOutline
Introduction
The ListExtract Approach
Experiments
Conclusion
Lists on the WebLists on the Web
Lists on the WebLists on the Web
Lists on the WebLists on the Web
Lists on the WebLists on the Web
• Our Goal: Extract tabular data from all
such lists in an unsupervised and domain-independent
manner.
• Not the typical wrapper generation problem.
Cartoons ExampleCartoons Example
A period (“.”) is used both as a delimiter and to
terminate abbreviations
A slash (“/”) is used both as a delimiter and as part of the text
The slash (“/”) delimiter is
missing (along with the prod.
year)
• Easy for Humans
• Confusing for Machines
Key ContributionsKey Contributions
Developed the ListExtract System, which extracts tables from lists in an unsupervised and domain-independent manner
Introduced using external sources of information such as a large collection of tables collected from the web and a language model to help in the splitting decisions
Conducted a large-scale experimental study which suggests that tens of millions of high-quality lists can be exploited on the Web.
OutlineOutline
IntroductionIntroduction
The ListExtract Approach
Experiments
Conclusion
ListExtract ApproachListExtract Approach
SplittingLines into Records
Aligning Short Records
(Null Insertion)
Decidingthe Number of Columns
Re-SplittingLong Records
Re-Aligning Detected
Field Streaks (Null Insertion)
Detecting Inconsistent
Fields
Re-Splitting Detected
Field Streaks
Independent Splitting
Phase
AlignmentPhase
Refinement Phase
Intermediate Outputs Intermediate Outputs (Independent Splitting Phase)(Independent Splitting Phase)
1 || What’s Opera Doc || Warner Bros || 1957
2 || Duck Amuck || Warner Bros || 1953
3 || The Band Concert || Disney || 1935
4. Duck Dodgers in the 24 1/2th Century (Warner Bros || 1953
5 || One Froggy Evening || Warner Bros || 1956
6 || Gertie the Dinosaur || McCay
…
17 || Popeye the Sailor || Meets || Sinbad the Sailor || Fletcher || 1936
Intermediate Outputs Intermediate Outputs (Re-Splitting Long Records)(Re-Splitting Long Records)
1 || What’s Opera Doc || Warner Bros || 1957
2 || Duck Amuck || Warner Bros || 1953
3 || The Band Concert || Disney || 1935
4. Duck Dodgers in the 24 1/2th Century (Warner Bros || 1953
5 || One Froggy Evening || Warner Bros || 1956
6 || Gertie the Dinosaur || McCay
…
17. Popeye the Sailor Meets || Sinbad the Sailor || Fletcher || 1936
Number of Columns = 4
Intermediate Outputs Intermediate Outputs
(Alignment Phase)(Alignment Phase)
1 What’s Opera Doc Warner Bros 1957
2 Duck Amuck Warner Bros 1953
3 The Band Concert Disney 1935
4. Duck Dodgers in the 24 1/2th Century (Warner Bros 1953
5 One Froggy Evening Warner Bros 1956
6 Gertie the Dinosaur McCay
… … … …
17. Popeye the Sailor Meets Sinbad the Sailor Fletcher 1936
Final Output Final Output
(Refinement Phase)(Refinement Phase)
1 What’s Opera Doc Warner Bros 1957
2 Duck Amuck Warner Bros 1953
3 The Band Concert Disney 1935
4 Duck Dodgers in the 24 1/2th Century Warner Bros 1953
5 One Froggy Evening Warner Bros 1956
6 Gertie the Dinosaur McCay
… … … …
17 Popeye the Sailor Meets Sinbad the Sailor (Fletcher) 1936
ListExtract ApproachListExtract Approach
SplittingLines into Records
Aligning Short Records
(Null Insertion)
Decidingthe Number of Columns
Re-SplittingLong Records
Re-Aligning Detected
Field Streaks (Null Insertion)
Detecting Inconsistent
Fields
Re-Splitting Detected
Field Streaks
Independent Splitting
Phase
AlignmentPhase
Refinement Phase
Output
Input
Line Splitting AlgorithmLine Splitting Algorithm
The Band Concert 0.92
The Band 0.89
Disney 0.82
Band Concert 0.65
Disney 1935 0.51
1935 0.34
... …
3 0.15
Band Concert Disney 0.12
3 The Band 0.07
Concert Disney 1935 0.03
3 The Band Concert Disney 1935
0.01
√
The Band Concert
The Band Concert Disney
3 The Band Concert Disney 1935
3 The Band Concert Disney 1935
3. The Band Concert (Disney /1935)
pre-processing: (removing delimiters)
SubsequenceFQ
Score
√
√
√
Field Quality (FQ) ScoreField Quality (FQ) Score
Linear Combination of multiple score componentsEach component corresponds to a source of evidence
Score Components1. Data Type
Regular expressions to capture different data types (e.g. dates, emails, currencies, … etc)
Score: 1 if match found, 0 otherwise
2. Table CorpusCheck if candidate sequence existed as a field in the table corpusScore: 1 if exists, 0 otherwise
3. Language ModelMeasure the likelihood that candidate sequence occurs in free text, and the unlikelihood that overlapping sequences occur in free text.Score: a combination of the probabilities capturing both the likelihood and unlikelihood
ListExtract ApproachListExtract Approach
SplittingLines into Records
Aligning Short Records
(Null Insertion)
Decide on the Number of Columns
Re-SplittingLong Records
Re-Aligning Detected
Field Streaks (Null Insertion)
Detecting Inconsistent
Fields
Re-Splitting Detected
Field Streaks
Independent Splitting
Phase
AlignmentPhase
Refinement Phase
Majority Voting across all records
ListExtract ApproachListExtract Approach
SplittingLines into Records
Aligning Short Records
(Null Insertion)
Decide on the Number of Columns
Re-SplittingLong Records
Re-Aligning Detected
Field Streaks (Null Insertion)
Detecting Inconsistent
Fields
Re-Splitting Detected
Field Streaks
Independent Splitting
Phase
AlignmentPhase
Refinement Phase
Output
Input
Re-Splitting Long RecordsRe-Splitting Long Records
The Band Concert 0.92
The Band 0.89
Disney 0.82
Band Concert 0.65
Disney 1935 0.51
1935 0.34
... …
3 0.15
Band Concert Disney 0.12
3 The Band 0.07
Concert Disney 1935 0.03
3 The Band Concert Disney 1935
0.01
√
The Band Concert
3 The Band Concert Disney / 1935
3 The Band Concert Disney 1935
3. The Band Concert (Disney /1935)
pre-processing: (removing delimiters)
SubsequenceFQ
Score
√
√
√
Maximum Number of Output Fields = 3
ListExtract ApproachListExtract Approach
SplittingLines into Records
Aligning Short Records
(Null Insertion)
Decide on the Number of Columns
Re-SplittingLong Records
Re-Aligning Detected
Field Streaks (Null Insertion)
Detecting Inconsistent
Fields
Re-Splitting Detected
Field Streaks
Independent Splitting
Phase
AlignmentPhase
Refinement Phase
Aligning Short RecordsAligning Short Records
(Null Insertion)(Null Insertion)
... ... ...
... ... ... ...
... ...
... ... ... ...
Avg. FQScore
... ... ...
... ... ... ...
... ... ... ...
0.88
0.79
0.49
0.62
0.73
0.92
0.86
Independently SplitRecords
Aligning Short RecordsAligning Short Records
(Null Insertion)(Null Insertion)
… … … …
NULL
… … … …
… … … …
… … … …
… …
…
…
… … NULL
… … NULL NULL
Avg. FQScore
0.92
0.86
0.79
0.62
0.88
0.73
0.49
Independently SplitRecords
Output Table
1- Sorting 2- Iterative Alignment
Aligning Short RecordsAligning Short Records
(Null Insertion)(Null Insertion)
To align each record, we use the classical Needleman-Wunsch Sequence Alignment algortihm.
[NW, J. of Molecular Biology, 1970]
The two sequences: Sequence #1: Table columnsSequence #2: Fields of a short record
Design a Field-to-Field Consistency (F2FC) Score.
Use the average F2FC Score as the similarity measure for the alignment algorithm.
Field-to-Field Consistency Field-to-Field Consistency
(F2FC) Score(F2FC) Score
Linear combination of multiple score componentsEach component corresponds to a source of evidence
Score Components1. Data Type
Check if data types are consistent
2. Table CorpusCheck if two fields co-occur in the same column in a table in the corpus
3. SyntaxMeasure the consistency of the syntax of the two fields
(e.g. length, % of upper/lower case letters, digits, spaces, etc)
4. DelimitersMeasures the consistency between the delimiters on both sides of the two fields
ListExtract ApproachListExtract Approach
SplittingLines into Records
Aligning Short Records
(Null Insertion)
Decide on the Number of Columns
Re-SplittingLong Records
Re-Aligning Detected
Field Streaks (Null Insertion)
Detecting Inconsistent
Fields
Re-Splitting Detected
Field Streaks
Independent Splitting
Phase
AlignmentPhase
Refinement Phase
Refinement PhaseRefinement Phase
… … … … … …
… … … … … …
… … … … … …
… … … … … …
… … … … … …
… … … … … …
Output Table
Refinement PhaseRefinement Phase
… X … … … …
… … … … … X
… … X X X …
… … … … … …
… X X … … …
… … … … X …
Detect Inconsistent FieldsOutput Table
Refinement PhaseRefinement Phase
… … … … … …
… … … … … …
… … X X X …
… … … … … …
… X X … … …
… … … … … …
Detect Inconsistent Fields
Consider streaks only
Output Table
Refinement PhaseRefinement Phase
… … … … … …
… … … … … …
… … … …
… … … … … …
… … … … …
… … … … … …
Detect Inconsistent Fields
Consider streaks only
Re-merge
Output Table
Refinement PhaseRefinement Phase
… … … … … …
… … … … … …
… … √ √ √ …
… … … … … …
… √ √ … … …
… … … … … …
Detect Inconsistent Fields
Consider streaks only
Re-merge
Re-split (and re-align if needed)
Use extended FQ score
Output Table
Field Quality (FQ) ScoreField Quality (FQ) Score
[Revisited][Revisited]
Linear Combination of multiple score componentsEach component corresponds to a source of evidence
Score Components1. Data Type
2. Table Corpus
3. Language Model
4. List Support • favors candidates which are more consistent with the columns spanned by
the streak
ListExtract ApproachListExtract Approach
SplittingLines into Records
Aligning Short Records
(Null Insertion)
Decide on the Number of Columns
Re-SplittingLong Records
Re-Aligning Detected
Field Streaks (Null Insertion)
Detecting Inconsistent
Fields
Re-Splitting Detected
Field Streaks
Independent Splitting
Phase
AlignmentPhase
Refinement Phase
Table Extraction (TE) ScoreTable Extraction (TE) Score
Average FQ Score for all fields in the extracted table
Used to compare between and rank the extracted tables based on their extraction quality
OutlineOutline
IntroductionIntroduction
The ListExtract ApproachThe ListExtract Approach
Experiments
Conclusion
0.5
0.6
0.7
0.8
0.9
1
20 40 60 80 100
Top percentage of extracted tables
F-m
ea
su
re
Wlists TDLists
Overall Performance for Overall Performance for WLists and TDListsWLists and TDLists
WLists: A set of 20 manually-collected HTML lists spanning 20 different domains.
TDLists: A set of 100 lists derived from randomly-selected HTML tables
Effect of the Refinement PhaseEffect of the Refinement Phase
(WLists)(WLists)
0.5
0.6
0.7
0.8
0.9
1
20 40 60 80 100
Top percentage of extracted tables
F-m
ea
su
reRefinement No Refinement
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Table Extraction Score
Nu
mb
er
of
Ex
tra
cte
d T
ab
les
Large-Scale ExperimentLarge-Scale Experiment
A crawl of 100K web pages
100K extracted lists
32K lists after filtering
11K extracted tables with multiple columns
(0.65, ~1,000 tables)
(0.45, ~10,300 tables)
OutlineOutline
IntroductionIntroduction
The ListExtract ApproachThe ListExtract Approach
ExperimentsExperiments
Conclusion
ConclusionConclusion
Our work is a continuation of the efforts to extract structured data from the Web.
Our system, ListExtract, is completely unsupervised and does not assume any domain knowledge. It uses multiple sources of information to make its decisions.
Our results validate the quality of table extraction and suggest that a large number of high-quality lists can be exploited on the Web.
Thank youThank you
Questions?
Recommended