54
WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, University of Washington (presently with University of Michigan) Alon Halevy, Google Daisy Zhe Wang, UC Berkeley Eugene Wu, MIT Yang Zhang, MIT Proceedings of VLDB '08, Auckland, New Zealand Presented by : Udit Joshi

WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, University of Washington (presently with University of Michigan) Alon Halevy,

Embed Size (px)

Citation preview

WebTables: Exploring the Power of Tables on the Web

Michael J. Cafarella, University of Washington (presently with University of Michigan)

Alon Halevy, GoogleDaisy Zhe Wang, UC Berkeley

Eugene Wu, MITYang Zhang, MIT

Proceedings of VLDB '08, Auckland, New Zealand

Presented by : Udit Joshi

Introduction

• Web : A corpus of unstructured documents• Relational data often encountered• 14.1 billion HTML tables extracted by crawl• Non-relational tables filtered out• Corpus of 154M (1%) high quality relations • Searching and Ranking• Leveraging the statistical information

A typical use of the table tag to describe relational data

Contribution

• Ample user demand for structured data, visualisation• Around 30 million queries from Google’s 1-day log• Extracting a corpus of high quality relations (previous

work)• Determining effective Relation Ranking methods for

search• Analyzing and leveraging this corpus

Outline

•Relation Extraction

•Attribute Correlation Statistics Database (ACSDb)

Data Model

•Challenges

•Ranking Algorithms

Relation Search

•Schema auto-complete

•Attribute synonym finding

•Join graph traversal

ACSDb Applications

Experimental Results

Data Model

• Relation Extraction• Attribute Correlation Statistics Database

(ACSDb)

Relation Recovery

• Crawl based on <table> tag• Filter out non relational data

Relation extraction pipeline

Use of Table Tag to Describe Relational Data

Deep Web• Tables behind HTML forms• http://factfinder.census.gov/, http://www.cars.com/• Most deep web data not crawlable• Data in the Deep Web is huge• Google’s Deep Web Crawl Project uses ‘Surfacing’• Precomputes set of relevant form submissions• Search query for “citibank atm 94043” returns a parameterized

URL:http://locations.citibank.com/citibankV2/Index.aspx?zip=94022• Corpus 40% from deep web sources

Relational Recovery

• Two stages for extraction system:– Relational filtering (for “good” relations)– Metadata detection (in top row of table)

• HTML parser on a page crawl • 14.1B instances of the <table> tag.• Script to disregard tables used for layout,

forms, calendars, etc.

Relational Filtering

• Human judgment needed• 2 independent judges given training data• Scored from 1-5.• Qualifying score > 4

Relational Filtering

• Machine-learning classification problem• Pair human classifications to a set of automatically

extracted table features• Forms a supervised training set for the statistical learner

Statistics to help distinguish relational tables

> 1

less variation

Metadata Detection

• Only per-attribute labels needed.• Used in improving rank quality, data

visualization, construction of ACSDb.

Features to detect the header row in a table

Relation Extractor’s Performance

high recall low precision

equal weight

Data Model

• Relation Extraction• Attribute Correlation Statistics Database

(ACSDb)

Attribute Correlation Statistics Database (ACSDb)

• Simple collection of statistics about schema attributes

• Derived from corpus of html tables• combo_make_model_year = 13

single_make = 3068• Available as a single file for download• 5.4M unique attribute names, 2.6M unique

schemas

Source : http://www.eecs.umich.edu/~michjc/acsdb.html

Schema Freq

ACSDbRecovered Relations

name addr city state

zip

Dan S 16 Park Seattle WA 98195

Alon H 129 Elm Belmont CA 94011

make model

year

Toyota Camry 1984

name size last-modified

Readme.txt 182 Apr 26, 2005

cac.xml 813 Jul 23, 2008

make model year

color

Chrysler Volare 1974 yellow

Nissan Sentra 1994 red

make model year

Mazda Protégé 2003

Chevrolet Impala 1979

{make, model, year} 2

{name, size, last-modified} 1

{name, addr, city, state, zip} 1

{make, model, year, color} 1

• ACSDb used for computing attribute probabilities– p(“make”) = 3/5

p(“zip”) = 1/5– p(“addr” | “name”) = 1/2

Structure of Corpus

• Corpus R of databases• Each database R ∈ R is a single relation• URL Ru and offset Ri within page define R

• Schema Rs is an ordered list of attributes

Rs = [Grand Prix, Date, Winning Driver……]

• Rt is the list of tuples, size of tuple t ≤|Rs|

Extracting ACSDb from Corpus

Function createACS(R)A = {}seenDomains = {}for all R ∈ R

if getDomain(R.u) ∈ seenDomains[R.S] then seenDomains[R.S].add(getDomain(R.u)) A[R.S] = A[R.S] + 1end if

end for

Distribution of frequency-ordered unique schemas in ACSDb

Small number of schemas appear very frequently

Relational Search

• Challenges• Ranking Algorithms

Relational Search

• Search engine style keyword based queries• Query-appropriate visualizations• Structured operations supported over search

results• Good search relevance is the key

Relational Search

Keyword query

Possible visualization

Ranked list of databases returned

Relation Ranking Challenges

• Relations are a mixture of “structure” and “content”

• Lack incoming hyperlink anchor text used in traditional IR

• PageRank style metrics unsuitable• Inverted Index unsuitable

Relation Ranking Challenges

• No domain-specific schema graph• Applying word frequency to embedded tables• Factoring relations specific features– schema

elements, presence of keys, size of relation, # of NULLs

Relational Search

• Challenges• Ranking Algorithms

Naïve Rank• Query q and top k parameter as input • Query sent to search engine• Fetches top-k pages ,extracts tables from each

page• Stops even if less than k tables returned

1: Function naiveRank(q, k):2: let U = urls from web search for query q3: for i = 0 to k do4: emit getRelations(U[i])5: end for

Filter Rank• Slight improvement• Ensures k relations extracted

1: Function filterRank(q, k):2: let U = ranked urls from web search for query q3: let numEmitted = 04: for all u U do∈5: for all r getRelations(u) do∈6: if numEmitted >= k then7: return8: end if9: emit r; numEmitted + +10: end for11: end for

Feature Rank• No reliance on existing search engine• Uses several features to score each extracted

relation in the corpus• Feature scores combined using Linear Regression

Estimation (LRE) • LRE trained on thousand (q,relation) pairs• Judged by two judges on a scale of 1-5.• Results sorted on score

Feature Rank

1: Function featureRank(q, k):2: let R = set of all relations extracted from corpus3: let score(r R) = combination of per-relation features∈4: sort r R by score(r)∈5: for i = 0 to k do6: emit R[i]7: end for

Query independent features:# rows, # colshas-header?# of NULLs in table

Query dependent features:document-search rank of source page# hits on header# hits on leftmost column# hits on second-to-leftmost column# hits on table body

Subject matterSemantic key

Schema Rank• Uses ACSDb-based schema coherency score• Coherent Schema implies tighter relation• High: {make, model}• Low: {make, zipcode}• Pointwise Mutual Information (PMI) determines how

strongly two items are related.• Positive (strongly correlated) , Negative (negatively

correlated), 0 independent• Coherency score for schema S is average pairwise

PMI scores over all pairs of attributes in the schema.

Schema Rank• Coherency Score

• Pointwise Mutual Information (PMI)

• 0 , + & -

1: Function cohere(R):2: totalPMI = 03: for all a attrs(R), b attrs(R), a ≠ b do∈ ∈4: totalPMI = PMI(a, b)5: end for6: return totalPMI/(|R| (|R| − 1))∗

Indexing

• Inverted index (term -> docid, offset) • WebTables data exists in two dimensions• (term -> tableid, (x, y) offsets) better suited for ranking

function• Supports queries with spatial operators like samerow and

samecol• Example: Paris and France on same row,

Paris, London and Madrid in same column.

Web Tables Search System

Index split across servers

ACSDb Applications

• Schema Auto Complete• Attribute Synonym-Finding• Join Graph Traversal

Schema Auto-Complete

• To assist novice database designers• User enters one or more domain-specific attributes

(example: “make”)• System guesses suggestions appropriate to the target

domain (example: “model”, “year”, “price”, “mileage”)

Schema Auto-Complete

• Maximize p(S-I | I)• Probability values computed from ACSDb• Add to S from overall attribute set A• Threshold t set to .01

ACSDb Applications

• Schema Auto Complete• Attribute Synonym-Finding• Join Graph Traversal

Attribute Synonym-Finding

• Traditionally done using Thesauri• Do not support non-natural-language strings eg

tel-# • Input set of context attributes, C• Output list of attribute pairs P likely to be

synonymous in schemas that contain C• Example: For attribute “artist”, output is

“song/track”.

Attribute Synonym-Finding• For synonymous attributes a,b p(a,b) = 0• If p(a,b) = 0 & p(a)p(b) is large, syn score

high.• Synonyms appear in similar contexts C: for a

third attribute z, z C, z A, ∈ ∈p(z|a,C) ≈ p(z|b,C)

• If a, b always “replace” each other then denominator ≈ 0 else denominator is large

Attribute Synonym-Finding

1: Function SynFind(C, t):2: R = []3: A = all attributes that appear in ACSDb with C4: for a A, b A, s.t. a ≠ b do∈ ∈5: if (a, b) ACSDb then ∈6: // Score candidate pair with syn function7: if syn(a, b) > t then8: R.append(a, b)9: end if10: end if11: end for12: sort R in descending syn order13: return R

ACSDb Applications

• Schema Auto Complete• Attribute Synonym-Finding• Join Graph Traversal

Join Graph Traversal

• Assist a schema designer• Join Graph N,L • Node for every unique schema, undirected join link

between any 2 schemas sharing a label• Join graph cluttered• Cluster together similar schema neighbors

Join Neighbor Similarity• Measure whether shared attribute D plays similar role

in schema X and Y• Similar to coherency score, except probability inputs to

PMI fn conditioned on presence of D• Two schemas cohere well, clustered together• Used as distance metric to cluster schemas sharing an

attribute with S.• User can choose from fewer outgoing links.

Join Graph Traversal// input : ACSDb A, focal schema F// output : Join Graph (N,L) connecting any two schemas with shared attributes

1: Function ConstructJoinGraph(A, F):2: N = {}3: L = {}//schema S, shared attribute c4: for (S, c) A do∈5: N.add(S) // add node6: end for7: for (S, c) A do∈8: for attr F do∈9: if attr S then∈10: L.add((attr,F, S)) // add link11: end if12: end for13: end for14: return N,L

Experimental Results

Fraction of High Scoring Relevant Tables in Top-k

• Ranking: compared 4 algorithms on a test dataset , two judges• Judges rate (query,relation) pairs from 1-5• 1000 pairs over 30 queries• Queries chosen by hand• Fraction of top-k that are relevant (≥4) shows better

performance at higher grain

k Naïve

Filter Rank Rank-ACSDb

10 0.26 0.35 (35%)

0.43 (65%)

0.47 (81%)

20 0.33 0.47 (42%)

0.56 (70%)

0.59 (79%)

30 0.34 0.59 (74%)

0.66 (94%)

0.68 (100%)

Schema Auto-Completion

Baseball at-bats

File system contentsFile system contents

Baseball at-bats

Rate of attribute recall for 10 expert generated test schemas

• Output schema almost always coherent• Need to get most relevant attributes• 6 humans created schema for each case• Retained attributes ≥ 2 files sys ->address

book

• 3 tries

Incremental improvement

Ambiguous data

Synonym Finding

Fraction of correct synonyms in top-k ranked list from the synonym finder

Judge determines accuracy

Join Neighbor Similarity• Join Graph Traversal

Neighbor Schemas

Dataset generated from a workload of 10 focal schemas

Very few incorrect schema members

Future Scope• Using tuple-keys as an analogue to attribute labels,

create “data-suggest” feature• Creating new data sets by integrating this corpus with

user’s private data• Expanding the WebTables search engine to incorporate

a page quality metric like PageRank• Including non-HTML tables, deep web databases and

HTML Lists

Conclusion

• First large-scale attempt to extract relational info from corpus of HTML tables

• Created unique ACSDb statistics• Showed utility of ACSDb

References• V. Hristidis and Y. Papakonstantinou, “Discover: Keyword search in

relational databases”, In VLDB, 2002.• J. Madhavan, A. Y. Halevy, S. Cohen, X. L. Dong, S. R. Jeffery, D. Ko,

and C. Yu, “Structured data meets the web: A few observations”, IEEE Data Eng. Bull., 29(4):19–26, 2006.

• M. Cafarella, J. Madhavan, A. Halevy, ” Web-Scale Extraction of Structured Data”, SIGMOD Record 37(4): 55-61, 2008.

• M. Cafarella, A. Halevy, Z. Wang, E. Wu, and Y. Zhang, “Uncovering the relational web”, Eleventh International Workshop on the Web and Databases (WebDB), June 2008. Vancouver, Canada.

• M. Cafarella, A. Halevy, and J. Madhavan, “Structured Data on the Web”, Communications of the ACM 54(2): 72-79, 2011.