27
Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi Roma Tre (Creative Commons License, see last slide)

Data Extraction and Integration from Imprecise Web Sources

  • Upload
    luana

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Data Extraction and Integration from Imprecise Web Sources. Lorenzo Blanco , Mirko Bronzi, Valter Crescenzi , Paolo Merialdo , Paolo Papotti Università degli Studi Roma Tre (Creative Commons License , see last slide). Data-intensive websites. Data-intensive websites. target. - PowerPoint PPT Presentation

Citation preview

Page 1: Data Extraction and Integration from Imprecise  Web Sources

Data Extraction and Integration from Imprecise Web Sources

Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti

Università degli Studi Roma Tre

(Creative Commons License, see last slide)

Page 2: Data Extraction and Integration from Imprecise  Web Sources

Data-intensive websites

Page 3: Data Extraction and Integration from Imprecise  Web Sources

Website

Data-intensive websites

Database

Template1

Template2

Template3

target

Page 4: Data Extraction and Integration from Imprecise  Web Sources

Flint goal

…StockQuote

Last Min Max

Volume 52high Open

Page 5: Data Extraction and Integration from Imprecise  Web Sources

Flint

System architecture

WebSearch[WIDM08]

Data Extraction

Data Integration

The WebThe Web

Page 6: Data Extraction and Integration from Imprecise  Web Sources

Novel contribution

• Unsupervised• Automatic• Scalable• No knowledge available

Data Extraction

RoadRunner [Vldb01] ExAlg [Sigmod03]

TurboWrapper [Vldb07]

• Unsupervised• Automatic• Scalable• Uncertain Data• No labels available• No corpus available

Data Integration

WebTables [Vldb08]Cimple [Vldb07]

MetaQuerier [Cidr05]PayGo [Cidr07]

Page 7: Data Extraction and Integration from Imprecise  Web Sources

Data Extraction

Page 8: Data Extraction and Integration from Imprecise  Web Sources

Data Extraction

Page 9: Data Extraction and Integration from Imprecise  Web Sources

Data Extraction

AAPL, GOOG, MSFT, INTC, …

128.09, 439.54, 34.89, 112.37, …

127.81, 439.25, 32.13, 111.01, …

132.43, 443.82, 33.67, 114.32, …

0.50%, -0.38%, 1.23%, 3.92%, -1.65%, …

Add AAPL to Your Portfolio, Add GOOG to Your Portfolio, Add MSFT to Your Portfolio,

Add INTC to Your Portfolio, …

Page 10: Data Extraction and Integration from Imprecise  Web Sources

Data ExtractionHTML fragments taken from two pages belonging to the same website:

1,132,228 , 1,735,857/html/body/table/tr[1]/td[2]

$20.66 , $414.58/html/body/table/tr[2]/td[2]

$11.70 , $247.30/html/body/table/tr[3]/td[2]

$20.72 , $414.06/html/body/table/tr[4]/td[2]

Extraction error!

$0.02 , 99,494,200/html/body/table/tr[5]/td[2]

?

4,732,600 , null/html/body/table/tr[6]/td[2]

Page 11: Data Extraction and Integration from Imprecise  Web Sources

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

Page 12: Data Extraction and Integration from Imprecise  Web Sources

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5 t=0.5

Page 13: Data Extraction and Integration from Imprecise  Web Sources

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5 t=0.5

1.0 1.0 1.0

Page 14: Data Extraction and Integration from Imprecise  Web Sources

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5 t=0.5

Page 15: Data Extraction and Integration from Imprecise  Web Sources

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

0.6 1.0 1.0

Page 16: Data Extraction and Integration from Imprecise  Web Sources

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

?

1.0 1.0

Page 17: Data Extraction and Integration from Imprecise  Web Sources

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

1.0

Page 18: Data Extraction and Integration from Imprecise  Web Sources

t=0.7 t=0.7

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

1.0

Page 19: Data Extraction and Integration from Imprecise  Web Sources

t=0.7 t=0.7

Data Integration

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

Page 20: Data Extraction and Integration from Imprecise  Web Sources

t=0.7 t=0.7

Wrapper Refinement

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

10null

10

(min/max)

? ?

0.3 (weak) 0.3 (weak) 0.0 0.0

Page 21: Data Extraction and Integration from Imprecise  Web Sources

Wrapper Refinement

matching value

nearby template

tokens

//td[contains(text(),‘Open')]/../td[2]//td[contains(text(),‘Open')]/../../tr[5]/td[1]//td[contains(text(),‘Open')]/../../tr[5]/td[2]//td[contains(text(),‘High')]/../td[2]…

Page 22: Data Extraction and Integration from Imprecise  Web Sources

t=0.7 t=0.7

Wrapper Refinement

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

10null

10

(min/max)

1.0 1.0

103316

(max)

42510

(min)

//td[contains(text(),‘Max')]/../td[2]

//td[contains(text(),‘Min')]/../td[2]

Page 23: Data Extraction and Integration from Imprecise  Web Sources

t=0.7 t=0.7

Wrapper Refinement

103316

(max)

42510

(min)

AAGOMS

(stock)

103316

(max)

42510

(min)

AAGOMS

(stock)

t=0.5 t=0.5

62612

(price)

42510

(min)

AAGOMS

(stock)

10null

10

(min/max)

103316

(max)

42510

(min)

Page 24: Data Extraction and Integration from Imprecise  Web Sources

Experimental Results(100 websites for each domain)

Soccer domain(45,714 pages)

Attribute |m|

• Name 90• Birth Date 61• Height 54• Nationality 48• Club 43• Position 43• Weight 34• League 14

Videogame domain(49,262 pages)

Attribute |m|

• Title 86• Publisher 59• Developer 45• Genre 28• ESRB rating 40• Release Date 9• Platform 9• # Players 6

Finance domain(57,623 pages)

Attribute |m|

• Stock Symbol 84• Price Change 73• % Change 73• Volume 52• Day Low 43• Day High 41• Last Price 29• Open Price 24

Page 26: Data Extraction and Integration from Imprecise  Web Sources

the end!

http://flint.dia.uniroma3.it

Page 27: Data Extraction and Integration from Imprecise  Web Sources

License

• This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.