Upload
kiley
View
25
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Managing The Structured Web. Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010. The Structured Web. Web pages contain structure that is obvious to humans, though not machines Search engines are largely blind to it Databases need data that is perfectly structured. - PowerPoint PPT Presentation
Citation preview
Managing The Structured Web
Michael J. Cafarella University of Michigan
Michigan CSEApril 23, 2010
2
The Structured Web Web pages contain structure that
is obvious to humans, though not machines
Search engines are largely blind to it
Databases need data that is perfectly structured
4
Different Approaches Extraction Techniques
Tables: WebTables [WebDB’08, VLDB’08] Large-scale entity extraction: Structurepedia
[ongoing]
Applications Web data integration: Octopus [VLDB’09] Structure-aware Web search: Meez [ongoing]
Tools MapReduce Optimizer: Manimal [ongoing]
Progress in one reinforces others
5
Different Approaches Extraction Techniques
Tables: WebTables [WebDB’08, VLDB’08] (w/ Alon Halevy, Yang Zhang, Daisy Wang, Eugene
Wu) Large-scale entity extraction: Structurepedia
[ongoing]
Applications Web data integration: Octopus [VLDB’09] Structure-aware Web search: Meez [ongoing]
Tools MapReduce Optimizer: Manimal [ongoing]
(w/ Chris Re)
6
8
WebTables WebTables system automatically extracts dbs
from web crawl [WebDB08, “Uncovering…”, Cafarella et al][VLDB08, “WebTables: Exploring…”, Cafarella et al]
An extracted relation is one table plus labeled columns
Estimate that our crawl of 14.1B raw HTML tables contains ~154M good relational dbs
Raw crawled pages Raw HTML Tables Recovered Relations Applications
Schema Statistics
9
Schema Statistics Schema stats useful for computing attribute
probabilities p(“make”), p(“model”), p(“zipcode”) p(“make” | “model”), p(“make” | “zipcode”)
Allows many applications Schema “tab-complete” Synonym discovery Others
Progress in extraction technique enables new data applications
10
Manimal (ongoing) MapReduce very popular for “big data”
Easy for non-database programmers Parallelizable, but inefficient
RDBMSes challenging for “big data” Programming and admin relatively difficult When well-used, very efficient
Manimal is hybrid MapReduce/RDBS execution system Static analysis to extract code semantics if(score > 5)… database selection Extractions enable RDBMS-style optimizations
Progress in extraction enables new data tools