Upload
chad-alkins
View
222
Download
0
Embed Size (px)
Citation preview
Data Quality: the “other” Face of Big Data
Barna Saha, Divesh SrivastavaAT&T Labs-Research
Outline
¨ Introduction
¨ Discovering data quality semantics
¨ Repairing inconsistencies
¨ Open problems + Q/A
2
Big Data + Data Quality
¨ Big data: all about the V’s – Size: huge volume of data from multiple sources– Speed: dynamic data, collected and analyzed at high velocity– Complexity: huge variety of data and sources
¨ Goal: to extract significant value from big data
¨ Key issue: data quality– Raw data is often of questionable veracity– How do we obtain high quality information?
3
Case Study: Big Data Quality [LDL+13]¨ Study on two domains
– Belief of clean data– Poor quality data can have big impact
4
#Sources Period #Objects #Local-attrs
#Global-attrs
Considered items
Stock 55 7/2011 1000*20 333 153 16000*20
Flight 38 12/2011 1200*31 43 15 7200*31
Case Study: Big Data Quality
¨ Is the data consistent?– Tolerance to 1% value difference
5
Case Study: Big Data Quality
¨ Why such inconsistency?– Semantic ambiguity
6
Yahoo! Finance
Nasdaq
52wk Range: 25.38-95.71
52 Wk: 25.38-93.72
Day’s Range: 93.80-95.71
Case Study: Big Data Quality
¨ Why such inconsistency?– Unit errors
7
76,821,000
76.82B
Case Study: Big Data Quality
8
Case Study: Big Data Quality
¨ Why such inconsistency?– Pure errors
9
FlightView FlightAware
Orbitz
6:15 PM
6:15 PM
6:22 PM
9:40 PM8:33 PM
9:54 PM
Case Study: Big Data Quality
¨ Why such inconsistency?– Random sample of 20 data items + 5 items with largest # of values
10
Case Study: Big Data Quality
11
¨ Copying between sources?
Case Study: Big Data Quality
¨ Copying on erroneous data?
12
Case Study: Lessons Learned
¨ Big data has considerable inconsistency– Even in domains where poor quality data can have big impact– Semantics ambiguity, out of date data, unexplainable errors
¨ Data sources often copy from each other– Copying can happen on erroneous data, spreading poor quality data
13
Data Quality: By the Numbers
¨ Impact of poor data quality– Erroneous data costs US businesses $600 billion/year [E02]– In DW projects, data cleaning takes 30-80% of time and budget– Data quality tools market is growing at 16% annually, way over 7%
average for other IT segments [G07]
¨ How much data is erroneous– Enterprise data error rates: average of 1-5%, some > 30% [R98]– Only 1/3rd of XML Web documents with XSD/DTD are valid, 14%
even lack well-formedness [GM11]
14
Small Data Quality: How Was It Achieved?¨ Specify all domain knowledge as integrity constraints on data
– Reject updates that do not preserve integrity constraints– Works well when the domain is well understood and static
15
Big Data Quality: A Different Approach?¨ Big data: integrity constraints cannot be specified a priori
– Data diversity → complete domain knowledge is infeasible– Data evolution → domain knowledge quickly becomes obsolete– Too much rejected data → “small” data
16
Big Data Quality: A Different Approach?¨ Big data: integrity constraints cannot be specified a priori
– Data diversity → complete domain knowledge is infeasible– Data evolution → domain knowledge quickly becomes obsolete
¨ Solution: let the data speak for itself– Learn models (semantics) from the data– Identify data glitches as violations of the learned models– Repair data glitches and models in a timely manner
17
In This Tutorial
¨ A focus on well-structured data and logic-based data quality– Models: logical constraints, e.g., (C)FDs, IDs, MDs, EGDs, DCs– Repairs: cost-based modifications to the data and models
¨ What we do not discuss in this tutorial– Logic-based: consistent query answering, without data repairs– Statistics-based: statistical models, anomaly detection– Unstructured data: quality of audio, video
18
Outline
¨ Introduction
¨ Discovering data quality semantics
¨ Repairing inconsistencies
¨ Open problems + Q/A
19
Outline
¨ Introduction
¨ Discovering data quality semantics
¨ Repairing inconsistencies
¨ Open problems + Q/A
20
A Systematic Way to Data Quality
¨ Impose integrity constraints ¨ Errors and inconsistencies in data emerge as violation of
the constraints
Discovering/ Learning Data Quality Semantics
¨ “small data” manually specify rules that govern the data semantics¨ “big data”
– let the data speak for itself– Learn rules and patterns from the data
Discovering/ Learning Data Quality Semantics
¨ Variety of data– Looking at condition and context– Statistically robust measure
¨ Volume of data– Scalable algorithms
Efficiency vs Accuracy¨ Velocity of data
– Streaming and incremental algorithms
Instance of Sales Relation
[name, type, country][price, tax]
An Instance of Sales Relation
[name, type, country][price, tax]
The functional dependencydoes not hold
An Instance of Sales Relation
[name, type, country][price, tax]
An Instance of Sales Relation
[name, type, country][price, tax]
Conditional Functional Dependency
Full VS Condition
¨ Functional dependency specifies integrity constraints over the whole database
¨ High variety of data one size does NOT fit all
– Conditional Functional Dependency– Similarly, conditional inclusion dependency,
conditional sequential dependency, conditional conservation dependency
An Instance of Sales Relation
[name, type, country][price, tax]
Consider pattern[ -, -, UK || -, -]
An Instance of Sales Relation
[name, type, country][price, tax]
Consider pattern[ -, -, UK || -, -]
An Instance of Sales Relation
[name, type, country][price, tax]
Consider pattern[ -, -, UK || -, -]
Pattern must have enough support but it is ok to have small violations—these are possibly data errors
An Instance of Sales Relation
[name, type, country][price, tax]
Consider pattern[ -, -, UK || -, -]
Local Support= 7/20=0.35Local Confidence=6/7=0.857
An Instance of Sales Relation
[name, type, country][price, tax]
Global Support= 15/20=0.75Global Confidence=13/15=0.87
Exact VS Soft/ Approximate
¨ Exact approaches might lead to over fitting and large number of patterns– Open world assumption
¨ Notion of support and confidence for statistically robust measures
Learning Conditional Functional Dependencies (CFD)
¨ Given an embedded FD, learn the pattern tableaux¨ Learn CFD from the scratch
– Learn FD and also the pattern¨ Learnt CFD should have enough support and
confidence
Learning Pattern Tableaux [GKK+08]
¨ Generate the smallest size tableaux with given global support and global confidence – NP-Complete– Hard to Approximate
¨ Generate the smallest size tableaux with given global support and local confidence – NP-Complete– APX Hard– in tableaux size
Efficiency VS Accuracy [GKK+08]¨ Trade-off running time with accuracy of solution
– Learning Pattern Tableaux given embedded FD XY in tableaux size
Consider all instantiations of X Prune based on local confidence Now apply PARTIAL GREEDY COVERAGE until the desired
support is reached
An Instance of Sales Relation
[name, type, country][price, tax]
Consider pattern[ -, -, UK || -, -]
SET
ELEMENTS
Efficiency VS Accuracy [GKK+08]¨ Trade-off running time with accuracy of solution
– Learning Pattern Tableaux given embedded FD XY in tableaux size
Consider all instantiations of X Prune based on local confidence Now apply PARTIAL GREEDY COVERAGE until the desired
support is reached
¨ X=(A, B, C) A={a}, B={b}, C={c}¨ All instantiations of X : {-, -, -}, {a, -, -}, {-, b, -}, {-, -, c}, {a, b, -}, {a, -, c},
{-, b, c}, {a,b,c}¨ If |X|=K then the number of patterns is 2K
All Instantiations of X
Efficiency VS Accuracy [GKK+08]¨ Trade-off running time with accuracy of solution
– Learning Pattern Tableaux given embedded FD XY in tableaux size
Consider all instantiations of X Prune based on local confidence Now apply PARTIAL GREEDY COVERAGE until the desired
support is reached
¨ X=(A, B, C) A={a}, B={b}, C={c}¨ All instantiations of X : {-, -, -}, {a, -, -}, {-, b, -}, {-, -, c}, {a, b, -}, {a, -, c},
{-, b, c}, {a,b,c}¨ If |X|=K then the number of patterns is 2K
All Instantiations of XToo many sets to consider in each iteration
Efficiency VS Accuracy [GKK+08]
¨ Incremental generation of search space
{-,-,-}
{a,-,-} {-,b,-} {-,-,c}
{a,b,-} {a,-,c} {-,b,c}
{a,b,c}
Do not Instantiate the entire search space of X
Efficiency VS Accuracy [GKK+08]
¨ Incremental generation of search space
{-,-,-}
{a,-,-} {-,b,-} {-,-,c}
{a,b,-} {a,-,c} {-,b,c}
{a,b,c}
Start from here, if local confidence is not met then explore its children which are not already pruned
Do not Instantiate the entire search space of X
Efficiency VS Accuracy
¨ Incremental generation of search space
{-,-,-}
{a,-,-} {-,b,-} {-,-,c}
{a,b,-} {a,-,c} {-,b,c}
{a,b,c}
Start from here, if local confidence is not met then explore its children which are not already pruned
Do not Instantiate the entire search space of X
If local confidence is met then remove the entire sub-lattice incident on it
Efficiency VS Accuracy
¨ Incremental generation of search space
{-,-,-}
{a,-,-} {-,b,-} {-,-,c}
{a,b,-} {a,-,c} {-,b,c}
{a,b,c}
Start from here, if local confidence is not met then explore its children which are not already pruned
Do not Instantiate the entire search space of X
If local confidence is met then remove the entire sub-lattice incident on it
¨ Same search space exploration as PARTIAL GREEDY SET COVER
Streaming Validation of CFD [CGK+09]
¨ Massive amount of data arrives online¨ Learn CFD from sampled data, validate against
voluminous data– Data does not fit in memory
Create concise summary of data (fast)
Streaming Validation of CFD [CGK+09]
¨ Simple summaries do not work– Uniform sampling– Uniform group sampling
CFD
- -
Confidence=0.75 Confidence=1
Streaming Validation of CFD [CGK+09]
¨ Simple summaries do not work– Uniform sampling– Uniform group sampling
CFD
- -
Confidence=0.625 Confidence=1
Streaming Validation of CFD [CGK+09]
¨ Given a relation R and an embedded FD: X Y, create a synopsis of the data so that given any arbitrary CFD we can return an estimate of its confidence such that
Approximation for Efficiency
Streaming Validation of CFD [CGK+09]¨ Two Pass Algorithm
– Sample (reservoir sampling) O() rows uniformly– For each sampled row that satisfies CFD on X
Sample (reservoir sampling) from its support O() rows and estimate confidence
Alternate: Maintain heavy hitter with space O()– Return average confidence
Streaming Validation of CFD [CGK+09]¨ Converting to a Single Pass
– Main Idea Classify groups based on exponentially decreasing support Keep summary for groups sampled at each level
Streaming Validation of CFD [CGK+09]¨ Converting to a Single Pass
Estimate support of the group:
Estimate confidence of the group
Overall Estimate=
Learning CFD from the scratch [FGLX+09]¨ Classification of CFD
– Constant CFD: patterns only contain constants– Variable CFD: patterns may contain wildcard “-”
Learning Constant CFD is more efficient than Variable CFD
Variable CFD gives more concise pattern
Learning CFD from the scratch [FGLX+09]
¨ What kind of CFD do we want to learn ?– Minimal CFD:
Constant minimal CFD :
Variable minimal CFD : or,
Frequent CFD: must have support over a threshold
Learning CFD from the Scratch [FGLX+09]
¨ A useful definition– Free Item set:
– Closed Item set
Learning CFD from the Scratch [FGLX+09]
¨ A useful definition– Free Item set:
– Closed Item set
1. Clearly if is a minimal CFD then is free and has the same support, so contained in close2. Also there should not exist any free with the property that and
Learning CFD from the Scratch [FGLX+09]
¨ CFD Miner• Suppose we have all k-frequent closed sets and their corresponding k-
frequent free sets to our disposal (GCGROWTH)
[Property 1: only possible consequent]• If there exists a = [Property 2]• Return for each the CFD
Variable CFD• CTANE:
Extension of TANE for FD Level-wise algorithm explores the
attribute-set/pattern lattice• FASTCFD
Extension of FASTFD for FD Depth first search approach
Learning CFD from the Scratch [FGLX+09]
Some Other Dependencies
¨ Inclusion¨ Matching¨ Sequential¨ Conservation¨ Denial
58
Inclusion Dependency
¨ Example. every manager is an employee¨ Extension by condition and approximation
– Example: Most persons in English DBpedia born in the 19th century and dying in USA are also in German DBpedia
¨ Learning CIND given IND
59
Matching Dependency
• Generalization of entity resolution• If two tuples show similarities in values in certain
attributes, then a given attribute value of these tuples must be matched (made same)
If name and phone numbers are sufficiently similar make their address identical
Sequential Dependency
• Useful to express relationships between ordered attributes
• : difference between Y-attribute values of any two consecutive records when sorted on X must be in
• Can identify missing data (gaps too large), extraneous data (gaps too low), out of order data
• Extension: approximate, conditional• Creating pattern tableaux efficiently
Conservation Dependency
• Useful to express relationships between two or multiple time series
• Extension: approximate, conditional• Creating pattern tableaux efficiently
Total inflow over time must match total outflow over time
Learning Pattern Tableaux Efficiently [GKK+12]
• Conservation Dependency: Quick Flavor
• Extension with condition and approximation
Total inflow over time must match total outflow over time
Conservation Dependency: Defining the measure [GKK+12]
¨ Confidence of an interval = ¨ Ignores duration of violation
Incoming traffic at a router
Outgoing traffic at a router
10 8 6 4 610 8 6 4 6
IN
OUT
a1 a2 a3 a4 a5
a1 a2 a3 a4 a5
b1 b2 b3 b4 b5 b1 b2 b3 b4 b5
34
Conservation Dependency: Defining the measure [GKK+12]
¨ Confidence of an interval = ¨ Ignores duration of violation
Incoming traffic at a router
Outgoing traffic at a router
10 8 6 4 610 8 6 4 6
IN
OUT
a1 a2 a3 a4 a5
a1 a2 a3 a4 a5
b1 b2 b3 b4 b5 b1 b2 b3 b4 b5
34
¨ Rightward Matching between IN and OUT: travel minimally to right to get matched
¨ A special case of EARTH MOVER DISTANCE
Confidence=1
Incoming traffic at a router
Outgoing traffic at a router
Confidence=0
10 8 6 4 6
5
10 8 6 4 6
IN
OUT
a1 a2 a3 a4 a5 a1 a2 a3 a4 a5
b1 b2 b3 b4 b5b1 b2 b3 b4 b5
34
Conservation Dependency: Defining the measure [GKK+12]
¨ Confidence=
Confidence=1
Incoming traffic at a router
Outgoing traffic at a router
Confidence=0
10 8 6 4 6
5
10 8 6 4 6
IN
OUT
EMD=114Maximum EMD Possible=114
EMD=0Maximum EMD Possible=114
Conservation Dependency: Defining the measure [GKK+12]
¨ Confidence=
Confidence=1
Incoming traffic at a router
Outgoing traffic at a router
Confidence=0
10 8 6 4 6
5
10 8 6 4 6
IN
OUT
EMD=114Maximum EMD Possible=114
EMD=0Maximum EMD Possible=114
How do we find all maximal intervals with high confidence efficiently ?
Conservation Dependency: Defining the measure [GKK+12]
Conservation Dependency [GKK+12]
• Key Idea• Look at the cumulative curves (comes from EMD)• Consider only a subset of intervals (for efficiency)• Generate these subsets going backward from the n-th data point
( to ensure guaranteed approximation factor in near-linear time)
Conservation Dependency
• Key Idea• Look at the cumulative curves (comes from EMD)• Consider only a subset of intervals (for efficiency)• Generate these subsets going backward from the n-th data point
( to ensure guaranteed approximation factor in near-linear time)
Efficiency VS Accuracy
Denial Constraints
• Universally quantified first order logic• Much more expressive than FD and CFD• Examples:
A.) if two persons live in the same state, then one earning a lower salary has a lower tax rate;
B.) it is not possible to have single tax exemption greater than salary• Useful for data repairing, discovery of denial constraints
(with two attributes)
Semi-structured Data
• Flexible representation• Easy customization• Error-Prone
• Vast majority of XML documents on the WEB do not have an accompanying DTD or XSD Schema description
Learning DTD/XSD from XML corpus
• A good inference algorithm should satisfy1. Specialization: must minimally cover the given XML documents2. Generalization: cover all documents valid according to the “unknown” target schema but may not be present in the sample
Learning Document Type Definitions (DTDs)
DTD: Context free grammar with regular expression (RE) on the RHS.
For every element name, infer the RE describing all the strings that appear below that element name in the XML corpus
A seminal result by Gold: Class of all REs cannot be learned only from positive
examples Which subset of REs can be learnt efficiently ?
Learning Document Type Definitions (DTDs)[BNST+06] Which subset of REs can be learnt efficiently ?
Class of SINGLE OCCURRENCE REs (SORE)Every element name can appear only once. Example: is a SORE but is not
Class of CHAIN REGULAR EXPRESSIONS (CHARES)Subset of SORE: chain of factors Example: Experimentally performs better for generalization
Learning SORE [BNST+06]
¨ SORE is 2-testableA language is 2-testable when there is a set of start element names , a set of final element names , and a set of 2-grams such that iff the first symbol of belongs to , the last symbol of belongs to and every 2-grams of is in ¨ Example
a
b
a
b
cc
a
b
b a
Learning SORE [BNST+06]
¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.
¨ Convert the automaton to RE by rewriting
a
b
a
b
c
a
b
b a
Rewrite RulesDISJUNCTION: set of nodes all have same predecessor and successor set
i.) have no edge among themselves concatenate the nodes to have a single node (ii.) they have all the edges among themselves concatenate the nodes to have a single node (and add a self-loopa
(a+b) c
a
c
c
Learning SORE [BNST+06]
¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.
¨ Convert the automaton to RE by rewriting
Rewrite RulesSelf-loop: Delete r and add
a
(a+b) cc
a
(𝑎+𝑏 )+¿¿ cc
Learning SORE [BNST+06]
¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.
¨ Convert the automaton to RE by rewriting
Rewrite RulesConcatenation:
Concatenate into a single node
a
(𝑎+𝑏 )+¿¿ cc
(𝑎+𝑏 )+¿𝑐 ¿
Learning SORE [BNST+06]
¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.
¨ Convert the automaton to RE by rewriting
Rewrite RulesOptional: all successors of r are also successors of predecessors of r
Relabel r by r? And remove all edges from r’s predecessors to r’s successors
a
(𝑎+𝑏 )+¿¿ cc
(𝑎+𝑏 )+¿𝑐 ¿
Learning SORE [BNST+06]
¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.
¨ Convert the automaton to RE by rewriting
Rewrite RulesOptional: all successors of r are also successors of predecessors of r
Relabel r by r? And remove all edges from r’s predecessors to r’s successors
a
(𝑎+𝑏 )+¿¿ cc
(𝑎+𝑏 )+¿𝑐 ¿
If the underlying DTD is indeed SORE, the algorithm learns it
Learning XSD from XML corpus
¨ Content model of an element depends on context– Items in an order contains id and price– Items in a stock contains id, quantity in stock and depending on
whether it is atomic or composed—a list of sub-items– DTD does not distinguish between order items and stock items
¨ Single occurrence XSD only contains single occurrence regular expressions
Outline
¨ Introduction
¨ Discovering data quality semantics
¨ Repairing inconsistencies
¨ Open problems + Q/A
83
Repair Techniques
¨ Glitch repairs by value modification, for FDs + InDs [BFF+05]– Introduced the idea of cell equivalence classes
¨ Glitch + model repairs, for FDs [CM11]– Introduced the idea of model repairs
¨ Glitch repairs, for EGDs [GMP+13]– Introduced chase-based technique to repair many constraints
84
Repair Techniques
¨ Glitch repairs by value modification, for FDs + InDs [BFF+05]– Introduced the idea of cell equivalence classes
¨ Glitch + model repairs, for FDs [CM11]– Introduced the idea of model repairs
¨ Glitch repairs, for EGDs [GMP+13]– Introduced chase-based technique to repair many constraints
85
Repairs Using Value Modification [BFF+05]¨ Problem: Given a database D, FD and InD constraints C, such that
(D, C) is inconsistent, find repair D’ of D with minimum cost(D’)
¨ Result: The problem is NP-hard even for only FDs or only InDs
¨ Key ideas:– Focus on value modifications of FD RHS attributes– Cost model for repairs is based on value accuracy, repair similarity– Equivalence classes of cells with identical values in the repair
permits a delayed assignment of a value to an equivalence class
86
Repairs Using Value Modification [BFF+05]
¨ InD: Equip[Tel] → Customer[Tel]
87
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
Repairs Using Value Modification [BFF+05]
¨ InD: Equip[Tel] → Customer[Tel]
88
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
Repairs Using Value Modification [BFF+05]
¨ FD: Customer[Tel] → Customer[Name, Street, City, State, Zip]
89
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
Repairs Using Value Modification [BFF+05]
¨ FD: Customer[Tel] → Customer[Name, Street, City, State, Zip]
90
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
X
Repairs Using Value Modification [BFF+05]¨ Repair alternatives when records ti and tj violate FD: X → Y
¨ Value modification of LHS attributes X– Modify tj[X] to a value different from ti[X]– Unclear what (different) value should be assigned to tj[X]
¨ Value modification of RHS attributes Y– Modify tj[Y] to equal ti[Y] or vice versa– Use cost of repair to choose between alternatives– FD violations can always be repaired by modifying RHS attributes Y– Naïve approach can lead to non-termination
91
Repairs Using Value Modification [BFF+05]
¨ FD: Customer[Tel] → Customer[Name, Street, City, State, Zip]
92
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
Repairs Using Value Modification [BFF+05]
¨ FD: Customer[Tel] → Customer[Name, Street, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]
93
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
Repairs Using Value Modification [BFF+05]
¨ FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]
94
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
X
Repairs Using Value Modification [BFF+05]
¨ FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]
95
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8145 Bob Jones 5 Valley Centre NY 10012 1
t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
?
Repairs Using Value Modification [BFF+05]
¨ InD: Equip[Tel] → Customer[Tel] FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]
96
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8145 Bob Jones 5 Valley Centre NY 10012 1
t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
X
Repairs Using Value Modification [BFF+05]¨ Repair alternatives when record ti violates InD: Ri[X] → Rj[Y]
¨ Value modification of ti[X] – Modify tj[X] to a value tj[Y] for some tj in Rj
¨ Value modification of tj[Y] – Modify tj[Y] for some tj in Rj to equal ti[X]
¨ Use cost of repair to choose between alternatives
97
Repairs Using Value Modification [BFF+05]
¨ InD: Equip[Tel] → Customer[Tel] FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]
98
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8145 Bob Jones 5 Valley Centre NY 10012 1
t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8145 L55011 LU ze400 Mar-03 1
Repairs Using Value Modification [BFF+05]
¨ Greedily build equivalence classes of cells– {(t2, Tel), (t3, Tel), (t5, Tel), (t6, Tel)}– {(t1, Name), (t4, Name)}– …
99
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
Repairs Using Value Modification [BFF+05]
¨ Greedily build equivalence classes of cells, assign unique value– {(t2, Tel), (t3, Tel), (t5, Tel), (t6, Tel)} → 555-8145– {(t1, Name), (t4, Name)} → Alice Smith– …
100
CUSTOMER
TId Tel Name Street City State Zip Wt
t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2
t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2
t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1
t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1
EQUIP
Tid Tel SerNo EqMfct EqModel InstDate Wt
t5 555-8145 L55001 LU ze400 Jan-03 2
t6 555-8195 L55011 LU ze400 Mar-03 1
Repair Techniques
¨ Glitch repairs by value modification, for FDs + InDs [BFF+05]– Introduced the idea of cell equivalence classes
¨ Glitch + model repairs, for FDs [CM11]– Introduced the idea of model repairs
¨ Glitch repairs, for EGDs [GMP+13]– Introduced chase-based technique to repair many constraints
101
Repairing Data and Constraints [CM11]¨ Motivation: evolution of data semantics
¨ Problem: Given a database D, FD constraints C, such that (D, C) is inconsistent, find repair (D’, C’) with minimum cost
¨ Key ideas:– Allow value modifications of FD RHS or LHS attributes– Allow modifications of FDs in C by augmenting the LHS– Cost model for repairs is based on minimum description length
102
Repairing Data and Constraints [CM11]
¨ FD: [District, Region] → [AC, City, State]
103
Tid District Region Municipal AC Tel Street Zip City State
t1 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY
t2 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY
t3 Brookside Granville Glendale 613 299-1010 Westlane 10211 NY MA
t4 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA
t5 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA
t6 Brookside Granville Queen 517 930-2525 Main 60415 Chicago IL
t7 Brookside Granville Queen 517 888-5152 Main 60415 Chicago IL
t8 Brookside Granville Queen 517 888-5152 Main 60601 Chicago IL
t9 Brookside Granville Queen 517 888-5152 Bay 60601 Chicago IL
Repairing Data and Constraints [CM11]
¨ FD: [District, Region] → [AC, City, State]– Expensive repair using only value modifications
104
Tid District Region Municipal AC Tel Street Zip City State
t1 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY
t2 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY
t3 Brookside Granville Glendale 613 299-1010 Westlane 10211 NY MA
t4 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA
t5 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA
t6 Brookside Granville Queen 517 930-2525 Main 60415 Chicago IL
t7 Brookside Granville Queen 517 888-5152 Main 60415 Chicago IL
t8 Brookside Granville Queen 517 888-5152 Main 60601 Chicago IL
t9 Brookside Granville Queen 517 888-5152 Bay 60601 Chicago IL
Repairing Data and Constraints [CM11]¨ Repair alternatives when records ti and tj violate FD: X → Y
¨ Value modification of RHS attributes Y
¨ Value modification of LHS attributes X– Modify tj[X] to a value different from ti[X], supported by the data
¨ Repair constraints by augmenting LHS (X) with a new attribute– New attribute provides additional context
¨ Choose from alternatives using MDL-based cost model
105
MDL-Based Cost Model [CM11]
¨ Quantifies trade-off of a data repair versus a constraint repair
¨ Cost-model based on the three properties– Accuracy: value modifications must minimize distance– Redundancy: value modifications must be well supported in data,
constraint repairs must result in a higher degree of consistency– Conciseness: repaired constraints should explain, but not overfit
¨ Minimum description length (MDL) principle– Length of model + length to encode data given the model
106
Repairing Data and Constraints [CM11]
¨ Cheap repair of constraints and data– FD: [District, Region, Municipal] → [AC, City, State]– t3.State = NY
107
Tid District Region Municipal AC Tel Street Zip City State
t1 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY
t2 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY
t3 Brookside Granville Glendale 613 299-1010 Westlane 10211 NY MA
t4 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA
t5 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA
t6 Brookside Granville Queen 517 930-2525 Main 60415 Chicago IL
t7 Brookside Granville Queen 517 888-5152 Main 60415 Chicago IL
t8 Brookside Granville Queen 517 888-5152 Main 60601 Chicago IL
t9 Brookside Granville Queen 517 888-5152 Bay 60601 Chicago IL
EGD Based Cleaning Framework [GMP+13]¨ Many possible repairing strategies to obtain preferred values
– Using “master” data, e.g., table Src– Using confidence and distance – Using freshness and currency
¨ Issue: interaction between dependencies– Sensitivity to the order in which repairs are applied
108
Validating XML
¨ Validate well-formedness first: strong validation¨ Validate assuming well-formedness: validaton
109
Validating XML
¨ How to validate well-formedness in small space ?¨ What class of DTD can be validated in small memory when XML
document streams in ?
110
Validating Well-formedness in streaming setting¨ Streaming XML document¨ Can we check if the document is well-formed in small space ?
Well-formedness of XML Documents
¨ Open and close tags of XML documents must be well-formed
112
<article> <title>
A Relational Model for Large Shared Data Banks <authors> </title> <author>
<name>E. F. Codd
</name></author> </article>
<article> <title>
A Relational Model for Large Shared Data Banks <authors> </title> <author>
<name>E. F. Codd
</name></author> </article>
Validating Well-formedness in streaming setting [MMN10]¨ Streaming XML document¨ Can we check if it is well-formed in small space ?¨ Grammar of well-formed parentheses of s types: ¨ If we can validate for , we can also validate for with blow up
in space
Validating Well-formedness in streaming setting [MMN10]
¨ Validating for
– Example: – – Matching pair: ,
Validating Well-formedness in streaming setting [MMN10]
¨ Validating for
– Example: – – Matching pair: ,
Validating Well-formedness in streaming setting [MMN10]
¨ Define two hash functions g, for any subword as ¨ where
¨ h
If v is well-formed g(v)=h(v)=0 else probability that they are both 0 is very low
𝑝𝑖𝑠 𝑎𝑝𝑟𝑖𝑚𝑒 𝑖𝑛𝑏𝑒𝑡𝑤𝑒𝑒𝑛𝑛 {1+𝑐 }𝑎𝑛𝑑𝑛2 {1+𝑐 }𝑎𝑛𝑑𝛼 , 𝛽∈𝑢𝑛𝑖𝑓 [0 ,𝑝−1]
Validating Well-formedness in streaming setting [MMN10]
Algorithm (key idea)¨ Read parentheses and reduce them to the form wW where w
consists of only down steps and W consists of only upsteps¨ If w is empty,
– construct hashes for W and compute its length: push (g(W),h(W),|W|) in the stack
¨ Else – construct hashes for w, pop (g,h,l) from the stack, update
g=g+g(w), h=h+h(w), l=l-1 and push back to stack– If l=0 and both g and h are not identically to 0 ERROR– Construct hashes for W along with its length and insert in
the stack
Repairing Malformedness Efficiently [KSSY13]¨ Repairing based on edit distance
Repairing Malformedness [KSSY13]
¨ In the streaming setting only very restricted errors can be repaired¨ When there is sufficient memory to hold the entire XML document,
near linear time algorithms can be devised with guaranteed performance
¨ Extension to consider position of text¨ Extension to return multiple edits using branch and bound
Open Problems¨ Many learning problems are based on lattice structure
– Exploit this structure better– Example: CFD pattern tableaux learning uses partial greedy set cover. Can
we design a careful algorithm which will beat in the approximation bound ?
Open Problems¨ Streaming and distributed setting both for learning and detection are
extremely important– Very basic results so far– Data placement, replication become very useful for distributed
processing
Open Problems¨ Semistructured Data
– What is the most general model that is tractable (validation+repair) in different computation model for XML ?
– Learning distributions of types of errors Language Edit Distance Problem
Open Problems¨ Crowdsourcing
– Use crowd to distinguish between data and error – Extend crowd-based entity resolution technique to handle matching
dependencies– Model errors made by crowd themselves
Open Problems¨ Crowdsourcing
– Use crowd to distinguish between data and error – Extend crowd-based entity resolution technique to handle matching
dependencies– Model errors made by crowd themselves
?