10
Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010

Managing The Structured Web

  • Upload
    kiley

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Managing The Structured Web. Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010. The Structured Web. Web pages contain structure that is obvious to humans, though not machines Search engines are largely blind to it Databases need data that is perfectly structured. - PowerPoint PPT Presentation

Citation preview

Page 1: Managing The Structured Web

Managing The Structured Web

Michael J. Cafarella University of Michigan

Michigan CSEApril 23, 2010

Page 2: Managing The Structured Web

2

The Structured Web Web pages contain structure that

is obvious to humans, though not machines

Search engines are largely blind to it

Databases need data that is perfectly structured

Page 3: Managing The Structured Web
Page 4: Managing The Structured Web

4

Different Approaches Extraction Techniques

Tables: WebTables [WebDB’08, VLDB’08] Large-scale entity extraction: Structurepedia

[ongoing]

Applications Web data integration: Octopus [VLDB’09] Structure-aware Web search: Meez [ongoing]

Tools MapReduce Optimizer: Manimal [ongoing]

Progress in one reinforces others

Page 5: Managing The Structured Web

5

Different Approaches Extraction Techniques

Tables: WebTables [WebDB’08, VLDB’08] (w/ Alon Halevy, Yang Zhang, Daisy Wang, Eugene

Wu) Large-scale entity extraction: Structurepedia

[ongoing]

Applications Web data integration: Octopus [VLDB’09] Structure-aware Web search: Meez [ongoing]

Tools MapReduce Optimizer: Manimal [ongoing]

(w/ Chris Re)

Page 6: Managing The Structured Web

6

Page 7: Managing The Structured Web
Page 8: Managing The Structured Web

8

WebTables WebTables system automatically extracts dbs

from web crawl [WebDB08, “Uncovering…”, Cafarella et al][VLDB08, “WebTables: Exploring…”, Cafarella et al]

An extracted relation is one table plus labeled columns

Estimate that our crawl of 14.1B raw HTML tables contains ~154M good relational dbs

Raw crawled pages Raw HTML Tables Recovered Relations Applications

Schema Statistics

Page 9: Managing The Structured Web

9

Schema Statistics Schema stats useful for computing attribute

probabilities p(“make”), p(“model”), p(“zipcode”) p(“make” | “model”), p(“make” | “zipcode”)

Allows many applications Schema “tab-complete” Synonym discovery Others

Progress in extraction technique enables new data applications

Page 10: Managing The Structured Web

10

Manimal (ongoing) MapReduce very popular for “big data”

Easy for non-database programmers Parallelizable, but inefficient

RDBMSes challenging for “big data” Programming and admin relatively difficult When well-used, very efficient

Manimal is hybrid MapReduce/RDBS execution system Static analysis to extract code semantics if(score > 5)… database selection Extractions enable RDBMS-style optimizations

Progress in extraction enables new data tools