Upload
ely
View
57
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Schema Matching and Data Extraction over HTML Tables. Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University. supported by NSF. Introduction. Many tables on the Web How to integrate data stored in different tables? Detect the table of interest - PowerPoint PPT Presentation
Citation preview
Schema Matching and Data Extraction over HTML Tables
Cui Tao
Data Extraction Research GroupDepartment of Computer Science
Brigham Young University
supported by NSF
Introduction
Many tables on the Web How to integrate data stored in
different tables? Detect the table of interest Form attribute-value pairs (adjust if
necessary) Do extraction Infer mappings from extraction patterns
ProblemDetecting The Table of Interest
?
Problem
Different source table schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air
Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail,
Engine, Fuel Economy} Target database schema
{Car, Year, Make, Model, Mileage, Price, PhoneNr},
{Car, Feature}
Different schemas
ProblemAttribute is Value
Problem Attribute-Value is Value
? ?
ProblemValue is not Value
ProblemFactored Values
ProblemSplit Values
ProblemMerged Values
ProblemInformation Behind Links
Single-ColumnTable (formattedas list)
Tableextendingover severalpages
Solution Detect the table of interest Form attribute-value pairs (adjust
if necessary) Do extraction Infer mappings from extraction
patterns
SolutionDetect The Table of Interest
‘Real’ table test Same number of values Table size
Attribute test Density measure test
# of ontology extracted values total # of values in the table
Solution Remove Factoring
2001
2001
2001
2000
2000
2000
2000
2000
2000
1999
1999
SolutionReplace Boolean Values
SolutionForm Attribute-Value Pairs
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
SolutionAdjust Attribute-Value Pairs
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
SolutionAdd Information Hidden Behind Links
Unstructured and semi-structured:
concatenate
<Price, $7,988>, <Mileage, 63,168 miles>, <Body Type, Car>, <Body Style, 4 DR Sedan>, <Transmission, Automatic>, <Engine, 3.0 L V-6>, <Doors, 4>, <Fuel Type, Gas>, <Stock Number, 22764>, <VIN, 1FAFP52U2WA139879>
Single attribute value pairs:Pair them together
List:Mark the beginning
and the end
<
>
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Each row is a car.
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Experimental ResultsCar Advertisement Application domain 10 “training” tables
100% of the 57 mappings (no false mappings) 94.6% precision of the values in linked pages
(5.4% false declarations) 50 test tables
94.7% of the 300 mappings (no false mappings) On the bases of sampling 3,000 values in linked
pages, we obtained 97% recall and 86% precision
Other Applications Cell Phone Plan Application domain Soccer Player Application domain
Contribution Provides an approach to extract
information automatically from HTML tables
Suggests a different way to solve the problem of schema matching