Upload
hoanganh
View
225
Download
0
Embed Size (px)
Citation preview
Data CleaningJacob Lurye
CS265
KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye SIGMOD Conference 2015
Source: BigDansing: A System for Big Data Cleansing. Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin SIGMOD Conference 2015
Source: BigDansing: A System for Big Data Cleansing. Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin SIGMOD Conference 2015
Source: BigDansing: A System for Big Data Cleansing. Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin SIGMOD Conference 2015
Pr(city = ‘LA’ | zipcode = 90210)
A B C D E F G
Rossi Italy Rome Verona Italian Proto 1.78
Klate S. Africa Pretoria Pirates Afrikaans P. Eliz. 1.69
Pirlo Italy Madrid Juve Italian Flero 1.77
Integrity constraints?
Machine learning?
A B C D E F G
Rossi Italy Rome Verona Italian Proto 1.78
Klate S. Africa Pretoria Pirates Afrikaans P. Eliz. 1.69
Pirlo Italy Madrid Juve Italian Flero 1.77
We need something more...
What is KATARA?
1. Table pattern definition and discovery (using KBs)2. Table pattern validation via crowdsourcing3. Data annotation4. Repair recommendation
Resource Description Framework
Classes and Instances
Spielberg is an instance of class Director
E.T is an instance of class Sci-Fi Movie which is a subclass of Movie
KBs and table patterns: a few possibilities
Partial KB Coverage
“Does S. Africa hasCapital Pretoria?”
KBs and table patterns: a few possibilities
Not covered by the KB
“What are the possible relationships between Rossi and 1.78?”
SPARQL: a language for querying KBs
Q1: get relationships where both attributes are resources
Q2: get relationships with one resource and one literal
Evaluating possible column relationships
prob. of any entity appearing in the subject of property P
prob. of any entity being of type T
prob. of an entity being of type T and appearing as subject of P
Pattern validation
First, identify the variable that maximizes expected entropy reduction.
Remove tuples that violate validation, and repeat above until left with one pattern.
Query the crowd for validation.
Recognizing erroneous tuples
Just execute a SPARQL query on the tuple.
Fully covered? Otherwise, we need the crowd.
Recognizing erroneous tuples
Table pattern implies this, crowd says yes.
Table pattern implies this, crowd says no. Opportunity to
enrich the KB
Setup
(i) RankJoin (KATARA)(ii) Support(iii) MaxLike(iv) PGM
Data:
Algorithms:
Wikitables: 28 tables, avg. 32 tuples
Webtables: 30 tables, avg. 60 tuples
RelationalTables: 3 tables: Person: 317K tuples Soccer: 1625 tuples University: 1357 tuples
Ground truth table patterns
Effectiveness – pattern matching
RankJoin requires fewest top-k patterns to achievehigh F-measure
Paper’s reason for fast convergence:
DBpedia: 865 typesYago: 317K types
RankJoin outperformsMaxLike and PGM, and is nearly as fast as Support.
All are faster on DBPedia.
Efficiency – pattern matching
Effectiveness – crowdsourcing
10 students validate patternsgiven 5 tuples per question
Most improvementfrom first questionalone
Effectiveness – crowdsourcing
Question order matters.
MUVF (most uncertain variable first)
AVF (all variables independent)
Effectiveness – repair suggestion
EQ: equivalence-class approach
SCARE: ML approach
Randomly generated errors?
Only RelationalTables?