Data Extraction and Integration from Imprecise Web Sources
Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti
Università degli Studi Roma Tre
(Creative Commons License, see last slide)
Data-intensive websites
Website
Data-intensive websites
Database
Template1
Template2
Template3
target
Flint goal
…StockQuote
Last Min Max
Volume 52high Open
Flint
System architecture
WebSearch[WIDM08]
Data Extraction
Data Integration
The WebThe Web
Novel contribution
• Unsupervised• Automatic• Scalable• No knowledge available
Data Extraction
RoadRunner [Vldb01] ExAlg [Sigmod03]
TurboWrapper [Vldb07]
• Unsupervised• Automatic• Scalable• Uncertain Data• No labels available• No corpus available
Data Integration
WebTables [Vldb08]Cimple [Vldb07]
MetaQuerier [Cidr05]PayGo [Cidr07]
Data Extraction
Data Extraction
Data Extraction
AAPL, GOOG, MSFT, INTC, …
128.09, 439.54, 34.89, 112.37, …
127.81, 439.25, 32.13, 111.01, …
132.43, 443.82, 33.67, 114.32, …
0.50%, -0.38%, 1.23%, 3.92%, -1.65%, …
Add AAPL to Your Portfolio, Add GOOG to Your Portfolio, Add MSFT to Your Portfolio,
Add INTC to Your Portfolio, …
…
Data ExtractionHTML fragments taken from two pages belonging to the same website:
1,132,228 , 1,735,857/html/body/table/tr[1]/td[2]
$20.66 , $414.58/html/body/table/tr[2]/td[2]
$11.70 , $247.30/html/body/table/tr[3]/td[2]
$20.72 , $414.06/html/body/table/tr[4]/td[2]
Extraction error!
$0.02 , 99,494,200/html/body/table/tr[5]/td[2]
?
4,732,600 , null/html/body/table/tr[6]/td[2]
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
1.0 1.0 1.0
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
0.6 1.0 1.0
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
?
1.0 1.0
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
1.0
t=0.7 t=0.7
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
1.0
t=0.7 t=0.7
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
t=0.7 t=0.7
Wrapper Refinement
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
10null
10
(min/max)
? ?
0.3 (weak) 0.3 (weak) 0.0 0.0
Wrapper Refinement
matching value
nearby template
tokens
//td[contains(text(),‘Open')]/../td[2]//td[contains(text(),‘Open')]/../../tr[5]/td[1]//td[contains(text(),‘Open')]/../../tr[5]/td[2]//td[contains(text(),‘High')]/../td[2]…
t=0.7 t=0.7
Wrapper Refinement
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
10null
10
(min/max)
1.0 1.0
103316
(max)
42510
(min)
//td[contains(text(),‘Max')]/../td[2]
//td[contains(text(),‘Min')]/../td[2]
t=0.7 t=0.7
Wrapper Refinement
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
10null
10
(min/max)
103316
(max)
42510
(min)
Experimental Results(100 websites for each domain)
Soccer domain(45,714 pages)
Attribute |m|
• Name 90• Birth Date 61• Height 54• Nationality 48• Club 43• Position 43• Weight 34• League 14
Videogame domain(49,262 pages)
Attribute |m|
• Title 86• Publisher 59• Developer 45• Genre 28• ESRB rating 40• Release Date 9• Platform 9• # Players 6
Finance domain(57,623 pages)
Attribute |m|
• Stock Symbol 84• Price Change 73• % Change 73• Volume 52• Day Low 43• Day High 41• Last Price 29• Open Price 24
Demo
• Found Websites• Integrated Data
the end!
http://flint.dia.uniroma3.it
License
• This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.