124
Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Embed Size (px)

Citation preview

Page 1: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Data Quality: the “other” Face of Big Data

Barna Saha, Divesh SrivastavaAT&T Labs-Research

Page 2: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Outline

¨ Introduction

¨ Discovering data quality semantics

¨ Repairing inconsistencies

¨ Open problems + Q/A

2

Page 3: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Big Data + Data Quality

¨ Big data: all about the V’s – Size: huge volume of data from multiple sources– Speed: dynamic data, collected and analyzed at high velocity– Complexity: huge variety of data and sources

¨ Goal: to extract significant value from big data

¨ Key issue: data quality– Raw data is often of questionable veracity– How do we obtain high quality information?

3

Page 4: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Case Study: Big Data Quality [LDL+13]¨ Study on two domains

– Belief of clean data– Poor quality data can have big impact

4

#Sources Period #Objects #Local-attrs

#Global-attrs

Considered items

Stock 55 7/2011 1000*20 333 153 16000*20

Flight 38 12/2011 1200*31 43 15 7200*31

Page 5: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Case Study: Big Data Quality

¨ Is the data consistent?– Tolerance to 1% value difference

5

Page 6: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Case Study: Big Data Quality

¨ Why such inconsistency?– Semantic ambiguity

6

Yahoo! Finance

Nasdaq

52wk Range: 25.38-95.71

52 Wk: 25.38-93.72

Day’s Range: 93.80-95.71

Page 7: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Case Study: Big Data Quality

¨ Why such inconsistency?– Unit errors

7

76,821,000

76.82B

Page 8: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Case Study: Big Data Quality

8

Page 9: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Case Study: Big Data Quality

¨ Why such inconsistency?– Pure errors

9

FlightView FlightAware

Orbitz

6:15 PM

6:15 PM

6:22 PM

9:40 PM8:33 PM

9:54 PM

Page 10: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Case Study: Big Data Quality

¨ Why such inconsistency?– Random sample of 20 data items + 5 items with largest # of values

10

Page 11: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Case Study: Big Data Quality

11

¨ Copying between sources?

Page 12: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Case Study: Big Data Quality

¨ Copying on erroneous data?

12

Page 13: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Case Study: Lessons Learned

¨ Big data has considerable inconsistency– Even in domains where poor quality data can have big impact– Semantics ambiguity, out of date data, unexplainable errors

¨ Data sources often copy from each other– Copying can happen on erroneous data, spreading poor quality data

13

Page 14: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Data Quality: By the Numbers

¨ Impact of poor data quality– Erroneous data costs US businesses $600 billion/year [E02]– In DW projects, data cleaning takes 30-80% of time and budget– Data quality tools market is growing at 16% annually, way over 7%

average for other IT segments [G07]

¨ How much data is erroneous– Enterprise data error rates: average of 1-5%, some > 30% [R98]– Only 1/3rd of XML Web documents with XSD/DTD are valid, 14%

even lack well-formedness [GM11]

14

Page 15: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Small Data Quality: How Was It Achieved?¨ Specify all domain knowledge as integrity constraints on data

– Reject updates that do not preserve integrity constraints– Works well when the domain is well understood and static

15

Page 16: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Big Data Quality: A Different Approach?¨ Big data: integrity constraints cannot be specified a priori

– Data diversity → complete domain knowledge is infeasible– Data evolution → domain knowledge quickly becomes obsolete– Too much rejected data → “small” data

16

Page 17: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Big Data Quality: A Different Approach?¨ Big data: integrity constraints cannot be specified a priori

– Data diversity → complete domain knowledge is infeasible– Data evolution → domain knowledge quickly becomes obsolete

¨ Solution: let the data speak for itself– Learn models (semantics) from the data– Identify data glitches as violations of the learned models– Repair data glitches and models in a timely manner

17

Page 18: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

In This Tutorial

¨ A focus on well-structured data and logic-based data quality– Models: logical constraints, e.g., (C)FDs, IDs, MDs, EGDs, DCs– Repairs: cost-based modifications to the data and models

¨ What we do not discuss in this tutorial– Logic-based: consistent query answering, without data repairs– Statistics-based: statistical models, anomaly detection– Unstructured data: quality of audio, video

18

Page 19: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Outline

¨ Introduction

¨ Discovering data quality semantics

¨ Repairing inconsistencies

¨ Open problems + Q/A

19

Page 20: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Outline

¨ Introduction

¨ Discovering data quality semantics

¨ Repairing inconsistencies

¨ Open problems + Q/A

20

Page 21: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

A Systematic Way to Data Quality

¨ Impose integrity constraints ¨ Errors and inconsistencies in data emerge as violation of

the constraints

Page 22: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Discovering/ Learning Data Quality Semantics

¨ “small data” manually specify rules that govern the data semantics¨ “big data”

– let the data speak for itself– Learn rules and patterns from the data

Page 23: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Discovering/ Learning Data Quality Semantics

¨ Variety of data– Looking at condition and context– Statistically robust measure

¨ Volume of data– Scalable algorithms

Efficiency vs Accuracy¨ Velocity of data

– Streaming and incremental algorithms

Page 24: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Instance of Sales Relation

[name, type, country][price, tax]

Page 25: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

An Instance of Sales Relation

[name, type, country][price, tax]

The functional dependencydoes not hold

Page 26: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

An Instance of Sales Relation

[name, type, country][price, tax]

Page 27: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

An Instance of Sales Relation

[name, type, country][price, tax]

Conditional Functional Dependency

Page 28: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Full VS Condition

¨ Functional dependency specifies integrity constraints over the whole database

¨ High variety of data one size does NOT fit all

– Conditional Functional Dependency– Similarly, conditional inclusion dependency,

conditional sequential dependency, conditional conservation dependency

Page 29: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

An Instance of Sales Relation

[name, type, country][price, tax]

Consider pattern[ -, -, UK || -, -]

Page 30: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

An Instance of Sales Relation

[name, type, country][price, tax]

Consider pattern[ -, -, UK || -, -]

Page 31: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

An Instance of Sales Relation

[name, type, country][price, tax]

Consider pattern[ -, -, UK || -, -]

Pattern must have enough support but it is ok to have small violations—these are possibly data errors

Page 32: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

An Instance of Sales Relation

[name, type, country][price, tax]

Consider pattern[ -, -, UK || -, -]

Local Support= 7/20=0.35Local Confidence=6/7=0.857

Page 33: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

An Instance of Sales Relation

[name, type, country][price, tax]

Global Support= 15/20=0.75Global Confidence=13/15=0.87

Page 34: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Exact VS Soft/ Approximate

¨ Exact approaches might lead to over fitting and large number of patterns– Open world assumption

¨ Notion of support and confidence for statistically robust measures

Page 35: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning Conditional Functional Dependencies (CFD)

¨ Given an embedded FD, learn the pattern tableaux¨ Learn CFD from the scratch

– Learn FD and also the pattern¨ Learnt CFD should have enough support and

confidence

Page 36: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning Pattern Tableaux [GKK+08]

¨ Generate the smallest size tableaux with given global support and global confidence – NP-Complete– Hard to Approximate

¨ Generate the smallest size tableaux with given global support and local confidence – NP-Complete– APX Hard– in tableaux size

Page 37: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Efficiency VS Accuracy [GKK+08]¨ Trade-off running time with accuracy of solution

– Learning Pattern Tableaux given embedded FD XY in tableaux size

Consider all instantiations of X Prune based on local confidence Now apply PARTIAL GREEDY COVERAGE until the desired

support is reached

Page 38: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

An Instance of Sales Relation

[name, type, country][price, tax]

Consider pattern[ -, -, UK || -, -]

SET

ELEMENTS

Page 39: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Efficiency VS Accuracy [GKK+08]¨ Trade-off running time with accuracy of solution

– Learning Pattern Tableaux given embedded FD XY in tableaux size

Consider all instantiations of X Prune based on local confidence Now apply PARTIAL GREEDY COVERAGE until the desired

support is reached

¨ X=(A, B, C) A={a}, B={b}, C={c}¨ All instantiations of X : {-, -, -}, {a, -, -}, {-, b, -}, {-, -, c}, {a, b, -}, {a, -, c},

{-, b, c}, {a,b,c}¨ If |X|=K then the number of patterns is 2K

All Instantiations of X

Page 40: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Efficiency VS Accuracy [GKK+08]¨ Trade-off running time with accuracy of solution

– Learning Pattern Tableaux given embedded FD XY in tableaux size

Consider all instantiations of X Prune based on local confidence Now apply PARTIAL GREEDY COVERAGE until the desired

support is reached

¨ X=(A, B, C) A={a}, B={b}, C={c}¨ All instantiations of X : {-, -, -}, {a, -, -}, {-, b, -}, {-, -, c}, {a, b, -}, {a, -, c},

{-, b, c}, {a,b,c}¨ If |X|=K then the number of patterns is 2K

All Instantiations of XToo many sets to consider in each iteration

Page 41: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Efficiency VS Accuracy [GKK+08]

¨ Incremental generation of search space

{-,-,-}

{a,-,-} {-,b,-} {-,-,c}

{a,b,-} {a,-,c} {-,b,c}

{a,b,c}

Do not Instantiate the entire search space of X

Page 42: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Efficiency VS Accuracy [GKK+08]

¨ Incremental generation of search space

{-,-,-}

{a,-,-} {-,b,-} {-,-,c}

{a,b,-} {a,-,c} {-,b,c}

{a,b,c}

Start from here, if local confidence is not met then explore its children which are not already pruned

Do not Instantiate the entire search space of X

Page 43: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Efficiency VS Accuracy

¨ Incremental generation of search space

{-,-,-}

{a,-,-} {-,b,-} {-,-,c}

{a,b,-} {a,-,c} {-,b,c}

{a,b,c}

Start from here, if local confidence is not met then explore its children which are not already pruned

Do not Instantiate the entire search space of X

If local confidence is met then remove the entire sub-lattice incident on it

Page 44: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Efficiency VS Accuracy

¨ Incremental generation of search space

{-,-,-}

{a,-,-} {-,b,-} {-,-,c}

{a,b,-} {a,-,c} {-,b,c}

{a,b,c}

Start from here, if local confidence is not met then explore its children which are not already pruned

Do not Instantiate the entire search space of X

If local confidence is met then remove the entire sub-lattice incident on it

¨ Same search space exploration as PARTIAL GREEDY SET COVER

Page 45: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Streaming Validation of CFD [CGK+09]

¨ Massive amount of data arrives online¨ Learn CFD from sampled data, validate against

voluminous data– Data does not fit in memory

Create concise summary of data (fast)

Page 46: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Streaming Validation of CFD [CGK+09]

¨ Simple summaries do not work– Uniform sampling– Uniform group sampling

CFD

- -

Confidence=0.75 Confidence=1

Page 47: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Streaming Validation of CFD [CGK+09]

¨ Simple summaries do not work– Uniform sampling– Uniform group sampling

CFD

- -

Confidence=0.625 Confidence=1

Page 48: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Streaming Validation of CFD [CGK+09]

¨ Given a relation R and an embedded FD: X Y, create a synopsis of the data so that given any arbitrary CFD we can return an estimate of its confidence such that

Approximation for Efficiency

Page 49: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Streaming Validation of CFD [CGK+09]¨ Two Pass Algorithm

– Sample (reservoir sampling) O() rows uniformly– For each sampled row that satisfies CFD on X

Sample (reservoir sampling) from its support O() rows and estimate confidence

Alternate: Maintain heavy hitter with space O()– Return average confidence

Page 50: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Streaming Validation of CFD [CGK+09]¨ Converting to a Single Pass

– Main Idea Classify groups based on exponentially decreasing support Keep summary for groups sampled at each level

Page 51: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Streaming Validation of CFD [CGK+09]¨ Converting to a Single Pass

Estimate support of the group:

Estimate confidence of the group

Overall Estimate=

Page 52: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning CFD from the scratch [FGLX+09]¨ Classification of CFD

– Constant CFD: patterns only contain constants– Variable CFD: patterns may contain wildcard “-”

Learning Constant CFD is more efficient than Variable CFD

Variable CFD gives more concise pattern

Page 53: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning CFD from the scratch [FGLX+09]

¨ What kind of CFD do we want to learn ?– Minimal CFD:

Constant minimal CFD :

Variable minimal CFD : or,

Frequent CFD: must have support over a threshold

Page 54: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning CFD from the Scratch [FGLX+09]

¨ A useful definition– Free Item set:

– Closed Item set

Page 55: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning CFD from the Scratch [FGLX+09]

¨ A useful definition– Free Item set:

– Closed Item set

1. Clearly if is a minimal CFD then is free and has the same support, so contained in close2. Also there should not exist any free with the property that and

Page 56: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning CFD from the Scratch [FGLX+09]

¨ CFD Miner• Suppose we have all k-frequent closed sets and their corresponding k-

frequent free sets to our disposal (GCGROWTH)

[Property 1: only possible consequent]• If there exists a = [Property 2]• Return for each the CFD

Page 57: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Variable CFD• CTANE:

Extension of TANE for FD Level-wise algorithm explores the

attribute-set/pattern lattice• FASTCFD

Extension of FASTFD for FD Depth first search approach

Learning CFD from the Scratch [FGLX+09]

Page 58: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Some Other Dependencies

¨ Inclusion¨ Matching¨ Sequential¨ Conservation¨ Denial

58

Page 59: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Inclusion Dependency

¨ Example. every manager is an employee¨ Extension by condition and approximation

– Example: Most persons in English DBpedia born in the 19th century and dying in USA are also in German DBpedia

¨ Learning CIND given IND

59

Page 60: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Matching Dependency

• Generalization of entity resolution• If two tuples show similarities in values in certain

attributes, then a given attribute value of these tuples must be matched (made same)

If name and phone numbers are sufficiently similar make their address identical

Page 61: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Sequential Dependency

• Useful to express relationships between ordered attributes

• : difference between Y-attribute values of any two consecutive records when sorted on X must be in

• Can identify missing data (gaps too large), extraneous data (gaps too low), out of order data

• Extension: approximate, conditional• Creating pattern tableaux efficiently

Page 62: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Conservation Dependency

• Useful to express relationships between two or multiple time series

• Extension: approximate, conditional• Creating pattern tableaux efficiently

Total inflow over time must match total outflow over time

Page 63: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning Pattern Tableaux Efficiently [GKK+12]

• Conservation Dependency: Quick Flavor

• Extension with condition and approximation

Total inflow over time must match total outflow over time

Page 64: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Conservation Dependency: Defining the measure [GKK+12]

¨ Confidence of an interval = ¨ Ignores duration of violation

Incoming traffic at a router

Outgoing traffic at a router

10 8 6 4 610 8 6 4 6

IN

OUT

a1 a2 a3 a4 a5

a1 a2 a3 a4 a5

b1 b2 b3 b4 b5 b1 b2 b3 b4 b5

34

Page 65: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Conservation Dependency: Defining the measure [GKK+12]

¨ Confidence of an interval = ¨ Ignores duration of violation

Incoming traffic at a router

Outgoing traffic at a router

10 8 6 4 610 8 6 4 6

IN

OUT

a1 a2 a3 a4 a5

a1 a2 a3 a4 a5

b1 b2 b3 b4 b5 b1 b2 b3 b4 b5

34

Page 66: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

¨ Rightward Matching between IN and OUT: travel minimally to right to get matched

¨ A special case of EARTH MOVER DISTANCE

Confidence=1

Incoming traffic at a router

Outgoing traffic at a router

Confidence=0

10 8 6 4 6

5

10 8 6 4 6

IN

OUT

a1 a2 a3 a4 a5 a1 a2 a3 a4 a5

b1 b2 b3 b4 b5b1 b2 b3 b4 b5

34

Conservation Dependency: Defining the measure [GKK+12]

Page 67: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

¨ Confidence=

Confidence=1

Incoming traffic at a router

Outgoing traffic at a router

Confidence=0

10 8 6 4 6

5

10 8 6 4 6

IN

OUT

EMD=114Maximum EMD Possible=114

EMD=0Maximum EMD Possible=114

Conservation Dependency: Defining the measure [GKK+12]

Page 68: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

¨ Confidence=

Confidence=1

Incoming traffic at a router

Outgoing traffic at a router

Confidence=0

10 8 6 4 6

5

10 8 6 4 6

IN

OUT

EMD=114Maximum EMD Possible=114

EMD=0Maximum EMD Possible=114

How do we find all maximal intervals with high confidence efficiently ?

Conservation Dependency: Defining the measure [GKK+12]

Page 69: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Conservation Dependency [GKK+12]

• Key Idea• Look at the cumulative curves (comes from EMD)• Consider only a subset of intervals (for efficiency)• Generate these subsets going backward from the n-th data point

( to ensure guaranteed approximation factor in near-linear time)

Page 70: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Conservation Dependency

• Key Idea• Look at the cumulative curves (comes from EMD)• Consider only a subset of intervals (for efficiency)• Generate these subsets going backward from the n-th data point

( to ensure guaranteed approximation factor in near-linear time)

Efficiency VS Accuracy

Page 71: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Denial Constraints

• Universally quantified first order logic• Much more expressive than FD and CFD• Examples:

A.) if two persons live in the same state, then one earning a lower salary has a lower tax rate;

B.) it is not possible to have single tax exemption greater than salary• Useful for data repairing, discovery of denial constraints

(with two attributes)

Page 72: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Semi-structured Data

• Flexible representation• Easy customization• Error-Prone

• Vast majority of XML documents on the WEB do not have an accompanying DTD or XSD Schema description

Page 73: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning DTD/XSD from XML corpus

• A good inference algorithm should satisfy1. Specialization: must minimally cover the given XML documents2. Generalization: cover all documents valid according to the “unknown” target schema but may not be present in the sample

Page 74: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning Document Type Definitions (DTDs)

DTD: Context free grammar with regular expression (RE) on the RHS.

For every element name, infer the RE describing all the strings that appear below that element name in the XML corpus

A seminal result by Gold: Class of all REs cannot be learned only from positive

examples Which subset of REs can be learnt efficiently ?

Page 75: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning Document Type Definitions (DTDs)[BNST+06] Which subset of REs can be learnt efficiently ?

Class of SINGLE OCCURRENCE REs (SORE)Every element name can appear only once. Example: is a SORE but is not

Class of CHAIN REGULAR EXPRESSIONS (CHARES)Subset of SORE: chain of factors Example: Experimentally performs better for generalization

Page 76: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning SORE [BNST+06]

¨ SORE is 2-testableA language is 2-testable when there is a set of start element names , a set of final element names , and a set of 2-grams such that iff the first symbol of belongs to , the last symbol of belongs to and every 2-grams of is in ¨ Example

a

b

a

b

cc

a

b

b a

Page 77: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning SORE [BNST+06]

¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.

¨ Convert the automaton to RE by rewriting

a

b

a

b

c

a

b

b a

Rewrite RulesDISJUNCTION: set of nodes all have same predecessor and successor set

i.) have no edge among themselves concatenate the nodes to have a single node (ii.) they have all the edges among themselves concatenate the nodes to have a single node (and add a self-loopa

(a+b) c

a

c

c

Page 78: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning SORE [BNST+06]

¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.

¨ Convert the automaton to RE by rewriting

Rewrite RulesSelf-loop: Delete r and add

a

(a+b) cc

a

(𝑎+𝑏 )+¿¿ cc

Page 79: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning SORE [BNST+06]

¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.

¨ Convert the automaton to RE by rewriting

Rewrite RulesConcatenation:

Concatenate into a single node

a

(𝑎+𝑏 )+¿¿ cc

(𝑎+𝑏 )+¿𝑐 ¿

Page 80: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning SORE [BNST+06]

¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.

¨ Convert the automaton to RE by rewriting

Rewrite RulesOptional: all successors of r are also successors of predecessors of r

Relabel r by r? And remove all edges from r’s predecessors to r’s successors

a

(𝑎+𝑏 )+¿¿ cc

(𝑎+𝑏 )+¿𝑐 ¿

Page 81: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning SORE [BNST+06]

¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.

¨ Convert the automaton to RE by rewriting

Rewrite RulesOptional: all successors of r are also successors of predecessors of r

Relabel r by r? And remove all edges from r’s predecessors to r’s successors

a

(𝑎+𝑏 )+¿¿ cc

(𝑎+𝑏 )+¿𝑐 ¿

If the underlying DTD is indeed SORE, the algorithm learns it

Page 82: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Learning XSD from XML corpus

¨ Content model of an element depends on context– Items in an order contains id and price– Items in a stock contains id, quantity in stock and depending on

whether it is atomic or composed—a list of sub-items– DTD does not distinguish between order items and stock items

¨ Single occurrence XSD only contains single occurrence regular expressions

Page 83: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Outline

¨ Introduction

¨ Discovering data quality semantics

¨ Repairing inconsistencies

¨ Open problems + Q/A

83

Page 84: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repair Techniques

¨ Glitch repairs by value modification, for FDs + InDs [BFF+05]– Introduced the idea of cell equivalence classes

¨ Glitch + model repairs, for FDs [CM11]– Introduced the idea of model repairs

¨ Glitch repairs, for EGDs [GMP+13]– Introduced chase-based technique to repair many constraints

84

Page 85: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repair Techniques

¨ Glitch repairs by value modification, for FDs + InDs [BFF+05]– Introduced the idea of cell equivalence classes

¨ Glitch + model repairs, for FDs [CM11]– Introduced the idea of model repairs

¨ Glitch repairs, for EGDs [GMP+13]– Introduced chase-based technique to repair many constraints

85

Page 86: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]¨ Problem: Given a database D, FD and InD constraints C, such that

(D, C) is inconsistent, find repair D’ of D with minimum cost(D’)

¨ Result: The problem is NP-hard even for only FDs or only InDs

¨ Key ideas:– Focus on value modifications of FD RHS attributes– Cost model for repairs is based on value accuracy, repair similarity– Equivalence classes of cells with identical values in the repair

permits a delayed assignment of a value to an equivalence class

86

Page 87: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]

¨ InD: Equip[Tel] → Customer[Tel]

87

CUSTOMER

TId Tel Name Street City State Zip Wt

t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2

t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2

t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1

t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1

EQUIP

Tid Tel SerNo EqMfct EqModel InstDate Wt

t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

Page 88: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]

¨ InD: Equip[Tel] → Customer[Tel]

88

CUSTOMER

TId Tel Name Street City State Zip Wt

t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2

t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2

t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1

t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1

EQUIP

Tid Tel SerNo EqMfct EqModel InstDate Wt

t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

Page 89: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]

¨ FD: Customer[Tel] → Customer[Name, Street, City, State, Zip]

89

CUSTOMER

TId Tel Name Street City State Zip Wt

t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2

t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2

t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1

t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1

EQUIP

Tid Tel SerNo EqMfct EqModel InstDate Wt

t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

Page 90: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]

¨ FD: Customer[Tel] → Customer[Name, Street, City, State, Zip]

90

CUSTOMER

TId Tel Name Street City State Zip Wt

t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2

t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2

t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1

t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1

EQUIP

Tid Tel SerNo EqMfct EqModel InstDate Wt

t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

X

Page 91: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]¨ Repair alternatives when records ti and tj violate FD: X → Y

¨ Value modification of LHS attributes X– Modify tj[X] to a value different from ti[X]– Unclear what (different) value should be assigned to tj[X]

¨ Value modification of RHS attributes Y– Modify tj[Y] to equal ti[Y] or vice versa– Use cost of repair to choose between alternatives– FD violations can always be repaired by modifying RHS attributes Y– Naïve approach can lead to non-termination

91

Page 92: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]

¨ FD: Customer[Tel] → Customer[Name, Street, City, State, Zip]

92

CUSTOMER

TId Tel Name Street City State Zip Wt

t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2

t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2

t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1

t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1

EQUIP

Tid Tel SerNo EqMfct EqModel InstDate Wt

t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

Page 93: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]

¨ FD: Customer[Tel] → Customer[Name, Street, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]

93

CUSTOMER

TId Tel Name Street City State Zip Wt

t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2

t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2

t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1

t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1

EQUIP

Tid Tel SerNo EqMfct EqModel InstDate Wt

t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

Page 94: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]

¨ FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]

94

CUSTOMER

TId Tel Name Street City State Zip Wt

t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2

t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2

t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1

t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1

EQUIP

Tid Tel SerNo EqMfct EqModel InstDate Wt

t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

X

Page 95: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]

¨ FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]

95

CUSTOMER

TId Tel Name Street City State Zip Wt

t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2

t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2

t3 555-8145 Bob Jones 5 Valley Centre NY 10012 1

t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1

EQUIP

Tid Tel SerNo EqMfct EqModel InstDate Wt

t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

?

Page 96: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]

¨ InD: Equip[Tel] → Customer[Tel] FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]

96

CUSTOMER

TId Tel Name Street City State Zip Wt

t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2

t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2

t3 555-8145 Bob Jones 5 Valley Centre NY 10012 1

t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1

EQUIP

Tid Tel SerNo EqMfct EqModel InstDate Wt

t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

X

Page 97: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]¨ Repair alternatives when record ti violates InD: Ri[X] → Rj[Y]

¨ Value modification of ti[X] – Modify tj[X] to a value tj[Y] for some tj in Rj

¨ Value modification of tj[Y] – Modify tj[Y] for some tj in Rj to equal ti[X]

¨ Use cost of repair to choose between alternatives

97

Page 98: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]

¨ InD: Equip[Tel] → Customer[Tel] FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]

98

CUSTOMER

TId Tel Name Street City State Zip Wt

t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2

t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2

t3 555-8145 Bob Jones 5 Valley Centre NY 10012 1

t4 949-1212 Alice Smith 17 Bridge Midville AZ 05211 1

EQUIP

Tid Tel SerNo EqMfct EqModel InstDate Wt

t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8145 L55011 LU ze400 Mar-03 1

Page 99: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]

¨ Greedily build equivalence classes of cells– {(t2, Tel), (t3, Tel), (t5, Tel), (t6, Tel)}– {(t1, Name), (t4, Name)}– …

99

CUSTOMER

TId Tel Name Street City State Zip Wt

t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2

t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2

t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1

t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1

EQUIP

Tid Tel SerNo EqMfct EqModel InstDate Wt

t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

Page 100: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairs Using Value Modification [BFF+05]

¨ Greedily build equivalence classes of cells, assign unique value– {(t2, Tel), (t3, Tel), (t5, Tel), (t6, Tel)} → 555-8145– {(t1, Name), (t4, Name)} → Alice Smith– …

100

CUSTOMER

TId Tel Name Street City State Zip Wt

t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2

t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2

t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1

t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1

EQUIP

Tid Tel SerNo EqMfct EqModel InstDate Wt

t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

Page 101: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repair Techniques

¨ Glitch repairs by value modification, for FDs + InDs [BFF+05]– Introduced the idea of cell equivalence classes

¨ Glitch + model repairs, for FDs [CM11]– Introduced the idea of model repairs

¨ Glitch repairs, for EGDs [GMP+13]– Introduced chase-based technique to repair many constraints

101

Page 102: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairing Data and Constraints [CM11]¨ Motivation: evolution of data semantics

¨ Problem: Given a database D, FD constraints C, such that (D, C) is inconsistent, find repair (D’, C’) with minimum cost

¨ Key ideas:– Allow value modifications of FD RHS or LHS attributes– Allow modifications of FDs in C by augmenting the LHS– Cost model for repairs is based on minimum description length

102

Page 103: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairing Data and Constraints [CM11]

¨ FD: [District, Region] → [AC, City, State]

103

Tid District Region Municipal AC Tel Street Zip City State

t1 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY

t2 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY

t3 Brookside Granville Glendale 613 299-1010 Westlane 10211 NY MA

t4 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA

t5 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA

t6 Brookside Granville Queen 517 930-2525 Main 60415 Chicago IL

t7 Brookside Granville Queen 517 888-5152 Main 60415 Chicago IL

t8 Brookside Granville Queen 517 888-5152 Main 60601 Chicago IL

t9 Brookside Granville Queen 517 888-5152 Bay 60601 Chicago IL

Page 104: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairing Data and Constraints [CM11]

¨ FD: [District, Region] → [AC, City, State]– Expensive repair using only value modifications

104

Tid District Region Municipal AC Tel Street Zip City State

t1 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY

t2 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY

t3 Brookside Granville Glendale 613 299-1010 Westlane 10211 NY MA

t4 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA

t5 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA

t6 Brookside Granville Queen 517 930-2525 Main 60415 Chicago IL

t7 Brookside Granville Queen 517 888-5152 Main 60415 Chicago IL

t8 Brookside Granville Queen 517 888-5152 Main 60601 Chicago IL

t9 Brookside Granville Queen 517 888-5152 Bay 60601 Chicago IL

Page 105: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairing Data and Constraints [CM11]¨ Repair alternatives when records ti and tj violate FD: X → Y

¨ Value modification of RHS attributes Y

¨ Value modification of LHS attributes X– Modify tj[X] to a value different from ti[X], supported by the data

¨ Repair constraints by augmenting LHS (X) with a new attribute– New attribute provides additional context

¨ Choose from alternatives using MDL-based cost model

105

Page 106: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

MDL-Based Cost Model [CM11]

¨ Quantifies trade-off of a data repair versus a constraint repair

¨ Cost-model based on the three properties– Accuracy: value modifications must minimize distance– Redundancy: value modifications must be well supported in data,

constraint repairs must result in a higher degree of consistency– Conciseness: repaired constraints should explain, but not overfit

¨ Minimum description length (MDL) principle– Length of model + length to encode data given the model

106

Page 107: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairing Data and Constraints [CM11]

¨ Cheap repair of constraints and data– FD: [District, Region, Municipal] → [AC, City, State]– t3.State = NY

107

Tid District Region Municipal AC Tel Street Zip City State

t1 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY

t2 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY

t3 Brookside Granville Glendale 613 299-1010 Westlane 10211 NY MA

t4 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA

t5 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA

t6 Brookside Granville Queen 517 930-2525 Main 60415 Chicago IL

t7 Brookside Granville Queen 517 888-5152 Main 60415 Chicago IL

t8 Brookside Granville Queen 517 888-5152 Main 60601 Chicago IL

t9 Brookside Granville Queen 517 888-5152 Bay 60601 Chicago IL

Page 108: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

EGD Based Cleaning Framework [GMP+13]¨ Many possible repairing strategies to obtain preferred values

– Using “master” data, e.g., table Src– Using confidence and distance – Using freshness and currency

¨ Issue: interaction between dependencies– Sensitivity to the order in which repairs are applied

108

Page 109: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Validating XML

¨ Validate well-formedness first: strong validation¨ Validate assuming well-formedness: validaton

109

Page 110: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Validating XML

¨ How to validate well-formedness in small space ?¨ What class of DTD can be validated in small memory when XML

document streams in ?

110

Page 111: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Validating Well-formedness in streaming setting¨ Streaming XML document¨ Can we check if the document is well-formed in small space ?

Page 112: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Well-formedness of XML Documents

¨ Open and close tags of XML documents must be well-formed

112

<article> <title>

A Relational Model for Large Shared Data Banks <authors> </title> <author>

<name>E. F. Codd

</name></author> </article>

<article> <title>

A Relational Model for Large Shared Data Banks <authors> </title> <author>

<name>E. F. Codd

</name></author> </article>

Page 113: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Validating Well-formedness in streaming setting [MMN10]¨ Streaming XML document¨ Can we check if it is well-formed in small space ?¨ Grammar of well-formed parentheses of s types: ¨ If we can validate for , we can also validate for with blow up

in space

Page 114: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Validating Well-formedness in streaming setting [MMN10]

¨ Validating for

– Example: – – Matching pair: ,

Page 115: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Validating Well-formedness in streaming setting [MMN10]

¨ Validating for

– Example: – – Matching pair: ,

Page 116: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Validating Well-formedness in streaming setting [MMN10]

¨ Define two hash functions g, for any subword as ¨ where

¨ h

If v is well-formed g(v)=h(v)=0 else probability that they are both 0 is very low

𝑝𝑖𝑠 𝑎𝑝𝑟𝑖𝑚𝑒 𝑖𝑛𝑏𝑒𝑡𝑤𝑒𝑒𝑛𝑛 {1+𝑐 }𝑎𝑛𝑑𝑛2 {1+𝑐 }𝑎𝑛𝑑𝛼 , 𝛽∈𝑢𝑛𝑖𝑓 [0 ,𝑝−1]

Page 117: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Validating Well-formedness in streaming setting [MMN10]

Algorithm (key idea)¨ Read parentheses and reduce them to the form wW where w

consists of only down steps and W consists of only upsteps¨ If w is empty,

– construct hashes for W and compute its length: push (g(W),h(W),|W|) in the stack

¨ Else – construct hashes for w, pop (g,h,l) from the stack, update

g=g+g(w), h=h+h(w), l=l-1 and push back to stack– If l=0 and both g and h are not identically to 0 ERROR– Construct hashes for W along with its length and insert in

the stack

Page 118: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairing Malformedness Efficiently [KSSY13]¨ Repairing based on edit distance

Page 119: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Repairing Malformedness [KSSY13]

¨ In the streaming setting only very restricted errors can be repaired¨ When there is sufficient memory to hold the entire XML document,

near linear time algorithms can be devised with guaranteed performance

¨ Extension to consider position of text¨ Extension to return multiple edits using branch and bound

Page 120: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Open Problems¨ Many learning problems are based on lattice structure

– Exploit this structure better– Example: CFD pattern tableaux learning uses partial greedy set cover. Can

we design a careful algorithm which will beat in the approximation bound ?

Page 121: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Open Problems¨ Streaming and distributed setting both for learning and detection are

extremely important– Very basic results so far– Data placement, replication become very useful for distributed

processing

Page 122: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Open Problems¨ Semistructured Data

– What is the most general model that is tractable (validation+repair) in different computation model for XML ?

– Learning distributions of types of errors Language Edit Distance Problem

Page 123: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Open Problems¨ Crowdsourcing

– Use crowd to distinguish between data and error – Extend crowd-based entity resolution technique to handle matching

dependencies– Model errors made by crowd themselves

Page 124: Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Open Problems¨ Crowdsourcing

– Use crowd to distinguish between data and error – Extend crowd-based entity resolution technique to handle matching

dependencies– Model errors made by crowd themselves

?