Data Quality: the “other” Face of Big Data Barna Saha, Divesh Srivastava AT&T Labs-Research

Data Quality: the “other” Face of Big Data

Barna Saha, Divesh SrivastavaAT&T Labs-Research

Outline

¨ Introduction

¨ Discovering data quality semantics

¨ Repairing inconsistencies

¨ Open problems + Q/A

2

Big Data + Data Quality

¨ Big data: all about the V’s – Size: huge volume of data from multiple sources– Speed: dynamic data, collected and analyzed at high velocity– Complexity: huge variety of data and sources

¨ Goal: to extract significant value from big data

¨ Key issue: data quality– Raw data is often of questionable veracity– How do we obtain high quality information?

3

Case Study: Big Data Quality [LDL+13]¨ Study on two domains

– Belief of clean data– Poor quality data can have big impact

4

#Sources Period #Objects #Local-attrs

#Global-attrs

Considered items

Stock 55 7/2011 1000*20 333 153 16000*20

Flight 38 12/2011 1200*31 43 15 7200*31

Case Study: Big Data Quality

¨ Is the data consistent?– Tolerance to 1% value difference

5


¨ Why such inconsistency?– Semantic ambiguity

6

Yahoo! Finance

Nasdaq

52wk Range: 25.38-95.71

52 Wk: 25.38-93.72

Day’s Range: 93.80-95.71


¨ Why such inconsistency?– Unit errors

7

76,821,000

76.82B


8


¨ Why such inconsistency?– Pure errors

9

FlightView FlightAware

Orbitz

6:15 PM

6:15 PM

6:22 PM

9:40 PM8:33 PM

9:54 PM


¨ Why such inconsistency?– Random sample of 20 data items + 5 items with largest # of values

10


11

¨ Copying between sources?


¨ Copying on erroneous data?

12

Case Study: Lessons Learned

¨ Big data has considerable inconsistency– Even in domains where poor quality data can have big impact– Semantics ambiguity, out of date data, unexplainable errors

¨ Data sources often copy from each other– Copying can happen on erroneous data, spreading poor quality data

13

Data Quality: By the Numbers

¨ Impact of poor data quality– Erroneous data costs US businesses $600 billion/year [E02]– In DW projects, data cleaning takes 30-80% of time and budget– Data quality tools market is growing at 16% annually, way over 7%

average for other IT segments [G07]

¨ How much data is erroneous– Enterprise data error rates: average of 1-5%, some > 30% [R98]– Only 1/3rd of XML Web documents with XSD/DTD are valid, 14%

even lack well-formedness [GM11]

14

Small Data Quality: How Was It Achieved?¨ Specify all domain knowledge as integrity constraints on data

– Reject updates that do not preserve integrity constraints– Works well when the domain is well understood and static

15

Big Data Quality: A Different Approach?¨ Big data: integrity constraints cannot be specified a priori

– Data diversity → complete domain knowledge is infeasible– Data evolution → domain knowledge quickly becomes obsolete– Too much rejected data → “small” data

16

Big Data Quality: A Different Approach?¨ Big data: integrity constraints cannot be specified a priori

– Data diversity → complete domain knowledge is infeasible– Data evolution → domain knowledge quickly becomes obsolete

¨ Solution: let the data speak for itself– Learn models (semantics) from the data– Identify data glitches as violations of the learned models– Repair data glitches and models in a timely manner

17

In This Tutorial

¨ A focus on well-structured data and logic-based data quality– Models: logical constraints, e.g., (C)FDs, IDs, MDs, EGDs, DCs– Repairs: cost-based modifications to the data and models

¨ What we do not discuss in this tutorial– Logic-based: consistent query answering, without data repairs– Statistics-based: statistical models, anomaly detection– Unstructured data: quality of audio, video

18

Outline

¨ Introduction




19

Outline

¨ Introduction




20

A Systematic Way to Data Quality

¨ Impose integrity constraints ¨ Errors and inconsistencies in data emerge as violation of

the constraints

Discovering/ Learning Data Quality Semantics

¨ “small data” manually specify rules that govern the data semantics¨ “big data”

– let the data speak for itself– Learn rules and patterns from the data

Discovering/ Learning Data Quality Semantics

¨ Variety of data– Looking at condition and context– Statistically robust measure

¨ Volume of data– Scalable algorithms

Efficiency vs Accuracy¨ Velocity of data

– Streaming and incremental algorithms

Instance of Sales Relation

[name, type, country][price, tax]

An Instance of Sales Relation


The functional dependencydoes not hold





Conditional Functional Dependency

Full VS Condition

¨ Functional dependency specifies integrity constraints over the whole database

¨ High variety of data one size does NOT fit all

– Conditional Functional Dependency– Similarly, conditional inclusion dependency,

conditional sequential dependency, conditional conservation dependency



Consider pattern[ -, -, UK || -, -]







Pattern must have enough support but it is ok to have small violations—these are possibly data errors




Local Support= 7/20=0.35Local Confidence=6/7=0.857



Global Support= 15/20=0.75Global Confidence=13/15=0.87

Exact VS Soft/ Approximate

¨ Exact approaches might lead to over fitting and large number of patterns– Open world assumption

¨ Notion of support and confidence for statistically robust measures

Learning Conditional Functional Dependencies (CFD)

¨ Given an embedded FD, learn the pattern tableaux¨ Learn CFD from the scratch

– Learn FD and also the pattern¨ Learnt CFD should have enough support and

confidence

Learning Pattern Tableaux [GKK+08]

¨ Generate the smallest size tableaux with given global support and global confidence – NP-Complete– Hard to Approximate

¨ Generate the smallest size tableaux with given global support and local confidence – NP-Complete– APX Hard– in tableaux size

Efficiency VS Accuracy [GKK+08]¨ Trade-off running time with accuracy of solution

– Learning Pattern Tableaux given embedded FD XY in tableaux size

Consider all instantiations of X Prune based on local confidence Now apply PARTIAL GREEDY COVERAGE until the desired

support is reached




SET

ELEMENTS




support is reached

¨ X=(A, B, C) A={a}, B={b}, C={c}¨ All instantiations of X : {-, -, -}, {a, -, -}, {-, b, -}, {-, -, c}, {a, b, -}, {a, -, c},

{-, b, c}, {a,b,c}¨ If |X|=K then the number of patterns is 2K

All Instantiations of X




support is reached

¨ X=(A, B, C) A={a}, B={b}, C={c}¨ All instantiations of X : {-, -, -}, {a, -, -}, {-, b, -}, {-, -, c}, {a, b, -}, {a, -, c},

{-, b, c}, {a,b,c}¨ If |X|=K then the number of patterns is 2K

All Instantiations of XToo many sets to consider in each iteration

Efficiency VS Accuracy [GKK+08]

¨ Incremental generation of search space

{-,-,-}

{a,-,-} {-,b,-} {-,-,c}

{a,b,-} {a,-,c} {-,b,c}

{a,b,c}

Do not Instantiate the entire search space of X

Efficiency VS Accuracy [GKK+08]


{-,-,-}

{a,-,-} {-,b,-} {-,-,c}

{a,b,-} {a,-,c} {-,b,c}

{a,b,c}

Start from here, if local confidence is not met then explore its children which are not already pruned


Efficiency VS Accuracy


{-,-,-}

{a,-,-} {-,b,-} {-,-,c}

{a,b,-} {a,-,c} {-,b,c}

{a,b,c}



If local confidence is met then remove the entire sub-lattice incident on it



{-,-,-}

{a,-,-} {-,b,-} {-,-,c}

{a,b,-} {a,-,c} {-,b,c}

{a,b,c}



If local confidence is met then remove the entire sub-lattice incident on it

¨ Same search space exploration as PARTIAL GREEDY SET COVER

Streaming Validation of CFD [CGK+09]

¨ Massive amount of data arrives online¨ Learn CFD from sampled data, validate against

voluminous data– Data does not fit in memory

Create concise summary of data (fast)


¨ Simple summaries do not work– Uniform sampling– Uniform group sampling

CFD

- -

Confidence=0.75 Confidence=1


¨ Simple summaries do not work– Uniform sampling– Uniform group sampling

CFD

- -

Confidence=0.625 Confidence=1


¨ Given a relation R and an embedded FD: X Y, create a synopsis of the data so that given any arbitrary CFD we can return an estimate of its confidence such that

Approximation for Efficiency

Streaming Validation of CFD [CGK+09]¨ Two Pass Algorithm

– Sample (reservoir sampling) O() rows uniformly– For each sampled row that satisfies CFD on X

Sample (reservoir sampling) from its support O() rows and estimate confidence

Alternate: Maintain heavy hitter with space O()– Return average confidence

Streaming Validation of CFD [CGK+09]¨ Converting to a Single Pass

– Main Idea Classify groups based on exponentially decreasing support Keep summary for groups sampled at each level

Streaming Validation of CFD [CGK+09]¨ Converting to a Single Pass

Estimate support of the group:

Estimate confidence of the group

Overall Estimate=

Learning CFD from the scratch [FGLX+09]¨ Classification of CFD

– Constant CFD: patterns only contain constants– Variable CFD: patterns may contain wildcard “-”

Learning Constant CFD is more efficient than Variable CFD

Variable CFD gives more concise pattern

Learning CFD from the scratch [FGLX+09]

¨ What kind of CFD do we want to learn ?– Minimal CFD:

Constant minimal CFD :

Variable minimal CFD : or,

Frequent CFD: must have support over a threshold

Learning CFD from the Scratch [FGLX+09]

¨ A useful definition– Free Item set:

– Closed Item set


¨ A useful definition– Free Item set:

– Closed Item set

1. Clearly if is a minimal CFD then is free and has the same support, so contained in close2. Also there should not exist any free with the property that and


¨ CFD Miner• Suppose we have all k-frequent closed sets and their corresponding k-

frequent free sets to our disposal (GCGROWTH)

[Property 1: only possible consequent]• If there exists a = [Property 2]• Return for each the CFD

Variable CFD• CTANE:

Extension of TANE for FD Level-wise algorithm explores the

attribute-set/pattern lattice• FASTCFD

Extension of FASTFD for FD Depth first search approach


Some Other Dependencies

¨ Inclusion¨ Matching¨ Sequential¨ Conservation¨ Denial

58

Inclusion Dependency

¨ Example. every manager is an employee¨ Extension by condition and approximation

– Example: Most persons in English DBpedia born in the 19th century and dying in USA are also in German DBpedia

¨ Learning CIND given IND

59

Matching Dependency

• Generalization of entity resolution• If two tuples show similarities in values in certain

attributes, then a given attribute value of these tuples must be matched (made same)

If name and phone numbers are sufficiently similar make their address identical

Sequential Dependency

• Useful to express relationships between ordered attributes

• : difference between Y-attribute values of any two consecutive records when sorted on X must be in

• Can identify missing data (gaps too large), extraneous data (gaps too low), out of order data

• Extension: approximate, conditional• Creating pattern tableaux efficiently

Conservation Dependency

• Useful to express relationships between two or multiple time series

• Extension: approximate, conditional• Creating pattern tableaux efficiently

Total inflow over time must match total outflow over time

Learning Pattern Tableaux Efficiently [GKK+12]

• Conservation Dependency: Quick Flavor

• Extension with condition and approximation

Total inflow over time must match total outflow over time

Conservation Dependency: Defining the measure [GKK+12]

¨ Confidence of an interval = ¨ Ignores duration of violation

Incoming traffic at a router

Outgoing traffic at a router

10 8 6 4 610 8 6 4 6

IN

OUT

a1 a2 a3 a4 a5

a1 a2 a3 a4 a5

b1 b2 b3 b4 b5 b1 b2 b3 b4 b5

34


¨ Confidence of an interval = ¨ Ignores duration of violation



10 8 6 4 610 8 6 4 6

IN

OUT

a1 a2 a3 a4 a5

a1 a2 a3 a4 a5

b1 b2 b3 b4 b5 b1 b2 b3 b4 b5

34

¨ Rightward Matching between IN and OUT: travel minimally to right to get matched

¨ A special case of EARTH MOVER DISTANCE

Confidence=1



Confidence=0

10 8 6 4 6

5

10 8 6 4 6

IN

OUT

a1 a2 a3 a4 a5 a1 a2 a3 a4 a5

b1 b2 b3 b4 b5b1 b2 b3 b4 b5

34


¨ Confidence=

Confidence=1



Confidence=0

10 8 6 4 6

5

10 8 6 4 6

IN

OUT

EMD=114Maximum EMD Possible=114



¨ Confidence=

Confidence=1



Confidence=0

10 8 6 4 6

5

10 8 6 4 6

IN

OUT



How do we find all maximal intervals with high confidence efficiently ?


Conservation Dependency [GKK+12]

• Key Idea• Look at the cumulative curves (comes from EMD)• Consider only a subset of intervals (for efficiency)• Generate these subsets going backward from the n-th data point

( to ensure guaranteed approximation factor in near-linear time)

Conservation Dependency

• Key Idea• Look at the cumulative curves (comes from EMD)• Consider only a subset of intervals (for efficiency)• Generate these subsets going backward from the n-th data point

( to ensure guaranteed approximation factor in near-linear time)


Denial Constraints

• Universally quantified first order logic• Much more expressive than FD and CFD• Examples:

A.) if two persons live in the same state, then one earning a lower salary has a lower tax rate;

B.) it is not possible to have single tax exemption greater than salary• Useful for data repairing, discovery of denial constraints

(with two attributes)

Semi-structured Data

• Flexible representation• Easy customization• Error-Prone

• Vast majority of XML documents on the WEB do not have an accompanying DTD or XSD Schema description

Learning DTD/XSD from XML corpus

• A good inference algorithm should satisfy1. Specialization: must minimally cover the given XML documents2. Generalization: cover all documents valid according to the “unknown” target schema but may not be present in the sample

Learning Document Type Definitions (DTDs)

DTD: Context free grammar with regular expression (RE) on the RHS.

For every element name, infer the RE describing all the strings that appear below that element name in the XML corpus

A seminal result by Gold: Class of all REs cannot be learned only from positive

examples Which subset of REs can be learnt efficiently ?

Learning Document Type Definitions (DTDs)[BNST+06] Which subset of REs can be learnt efficiently ?

Class of SINGLE OCCURRENCE REs (SORE)Every element name can appear only once. Example: is a SORE but is not

Class of CHAIN REGULAR EXPRESSIONS (CHARES)Subset of SORE: chain of factors Example: Experimentally performs better for generalization

Learning SORE [BNST+06]

¨ SORE is 2-testableA language is 2-testable when there is a set of start element names , a set of final element names , and a set of 2-grams such that iff the first symbol of belongs to , the last symbol of belongs to and every 2-grams of is in ¨ Example

a

b

a

b

cc

a

b

b a


¨ Given a set of strings, extract all the initial symbols for , all final symbols for and all -grams for . Create the automaton.

¨ Convert the automaton to RE by rewriting

a

b

a

b

c

a

b

b a

Rewrite RulesDISJUNCTION: set of nodes all have same predecessor and successor set

i.) have no edge among themselves concatenate the nodes to have a single node (ii.) they have all the edges among themselves concatenate the nodes to have a single node (and add a self-loopa

(a+b) c

a

c

c




Rewrite RulesSelf-loop: Delete r and add

a

(a+b) cc

a

(𝑎+𝑏 )+¿¿ cc




Rewrite RulesConcatenation:

Concatenate into a single node

a

(𝑎+𝑏 )+¿¿ cc

(𝑎+𝑏 )+¿𝑐 ¿




Rewrite RulesOptional: all successors of r are also successors of predecessors of r

Relabel r by r? And remove all edges from r’s predecessors to r’s successors

a

(𝑎+𝑏 )+¿¿ cc

(𝑎+𝑏 )+¿𝑐 ¿




Rewrite RulesOptional: all successors of r are also successors of predecessors of r

Relabel r by r? And remove all edges from r’s predecessors to r’s successors

a

(𝑎+𝑏 )+¿¿ cc

(𝑎+𝑏 )+¿𝑐 ¿

If the underlying DTD is indeed SORE, the algorithm learns it

Learning XSD from XML corpus

¨ Content model of an element depends on context– Items in an order contains id and price– Items in a stock contains id, quantity in stock and depending on

whether it is atomic or composed—a list of sub-items– DTD does not distinguish between order items and stock items

¨ Single occurrence XSD only contains single occurrence regular expressions

Outline

¨ Introduction




83

Repair Techniques

¨ Glitch repairs by value modification, for FDs + InDs [BFF+05]– Introduced the idea of cell equivalence classes

¨ Glitch + model repairs, for FDs [CM11]– Introduced the idea of model repairs

¨ Glitch repairs, for EGDs [GMP+13]– Introduced chase-based technique to repair many constraints

84

Repair Techniques




85

Repairs Using Value Modification [BFF+05]¨ Problem: Given a database D, FD and InD constraints C, such that

(D, C) is inconsistent, find repair D’ of D with minimum cost(D’)

¨ Result: The problem is NP-hard even for only FDs or only InDs

¨ Key ideas:– Focus on value modifications of FD RHS attributes– Cost model for repairs is based on value accuracy, repair similarity– Equivalence classes of cells with identical values in the repair

permits a delayed assignment of a value to an equivalence class

86

Repairs Using Value Modification [BFF+05]

¨ InD: Equip[Tel] → Customer[Tel]

87

CUSTOMER

TId Tel Name Street City State Zip Wt

t1 949-1212 Alice Smith 17 Bridge Midville AZ 05211 2

t2 555-8145 Bob Jones 5 Valley Centre NY 10012 2

t3 555-8195 Bob Jones 5 Valley Centre NJ 10012 1

t4 949-1212 Ali Smith 27 Bridge Midville AZ 05211 1

EQUIP

Tid Tel SerNo EqMfct EqModel InstDate Wt

t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1


¨ InD: Equip[Tel] → Customer[Tel]

88

CUSTOMER






EQUIP


t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1


¨ FD: Customer[Tel] → Customer[Name, Street, City, State, Zip]

89

CUSTOMER






EQUIP


t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1



90

CUSTOMER






EQUIP


t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

X

Repairs Using Value Modification [BFF+05]¨ Repair alternatives when records ti and tj violate FD: X → Y

¨ Value modification of LHS attributes X– Modify tj[X] to a value different from ti[X]– Unclear what (different) value should be assigned to tj[X]

¨ Value modification of RHS attributes Y– Modify tj[Y] to equal ti[Y] or vice versa– Use cost of repair to choose between alternatives– FD violations can always be repaired by modifying RHS attributes Y– Naïve approach can lead to non-termination

91



92

CUSTOMER






EQUIP


t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1


¨ FD: Customer[Tel] → Customer[Name, Street, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]

93

CUSTOMER






EQUIP


t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1


¨ FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]

94

CUSTOMER






EQUIP


t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

X


¨ FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]

95

CUSTOMER






EQUIP


t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

?


¨ InD: Equip[Tel] → Customer[Tel] FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]

96

CUSTOMER






EQUIP


t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

X

Repairs Using Value Modification [BFF+05]¨ Repair alternatives when record ti violates InD: Ri[X] → Rj[Y]

¨ Value modification of ti[X] – Modify tj[X] to a value tj[Y] for some tj in Rj

¨ Value modification of tj[Y] – Modify tj[Y] for some tj in Rj to equal ti[X]

¨ Use cost of repair to choose between alternatives

97


¨ InD: Equip[Tel] → Customer[Tel] FD: Customer[Tel] → Customer[Name, Steet, City, State, Zip] FD: Customer[Zip] → Customer[City, State] FD: Customer[Name, Street, Zip] → Customer[Tel]

98

CUSTOMER






EQUIP


t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8145 L55011 LU ze400 Mar-03 1


¨ Greedily build equivalence classes of cells– {(t2, Tel), (t3, Tel), (t5, Tel), (t6, Tel)}– {(t1, Name), (t4, Name)}– …

99

CUSTOMER






EQUIP


t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1


¨ Greedily build equivalence classes of cells, assign unique value– {(t2, Tel), (t3, Tel), (t5, Tel), (t6, Tel)} → 555-8145– {(t1, Name), (t4, Name)} → Alice Smith– …

100

CUSTOMER






EQUIP


t5 555-8145 L55001 LU ze400 Jan-03 2

t6 555-8195 L55011 LU ze400 Mar-03 1

Repair Techniques




101

Repairing Data and Constraints [CM11]¨ Motivation: evolution of data semantics

¨ Problem: Given a database D, FD constraints C, such that (D, C) is inconsistent, find repair (D’, C’) with minimum cost

¨ Key ideas:– Allow value modifications of FD RHS or LHS attributes– Allow modifications of FDs in C by augmenting the LHS– Cost model for repairs is based on minimum description length

102

Repairing Data and Constraints [CM11]

¨ FD: [District, Region] → [AC, City, State]

103

Tid District Region Municipal AC Tel Street Zip City State

t1 Brookside Granville Glendale 613 974-2345 Boxwood 10211 NY NY


t3 Brookside Granville Glendale 613 299-1010 Westlane 10211 NY MA

t4 Brookside Granville Guild 515 220-1200 Squire 02215 Boston MA


t6 Brookside Granville Queen 517 930-2525 Main 60415 Chicago IL



t9 Brookside Granville Queen 517 888-5152 Bay 60601 Chicago IL


¨ FD: [District, Region] → [AC, City, State]– Expensive repair using only value modifications

104











Repairing Data and Constraints [CM11]¨ Repair alternatives when records ti and tj violate FD: X → Y

¨ Value modification of RHS attributes Y

¨ Value modification of LHS attributes X– Modify tj[X] to a value different from ti[X], supported by the data

¨ Repair constraints by augmenting LHS (X) with a new attribute– New attribute provides additional context

¨ Choose from alternatives using MDL-based cost model

105

MDL-Based Cost Model [CM11]

¨ Quantifies trade-off of a data repair versus a constraint repair

¨ Cost-model based on the three properties– Accuracy: value modifications must minimize distance– Redundancy: value modifications must be well supported in data,

constraint repairs must result in a higher degree of consistency– Conciseness: repaired constraints should explain, but not overfit

¨ Minimum description length (MDL) principle– Length of model + length to encode data given the model

106


¨ Cheap repair of constraints and data– FD: [District, Region, Municipal] → [AC, City, State]– t3.State = NY

107











EGD Based Cleaning Framework [GMP+13]¨ Many possible repairing strategies to obtain preferred values

– Using “master” data, e.g., table Src– Using confidence and distance – Using freshness and currency

¨ Issue: interaction between dependencies– Sensitivity to the order in which repairs are applied

108

Validating XML

¨ Validate well-formedness first: strong validation¨ Validate assuming well-formedness: validaton

109

Validating XML

¨ How to validate well-formedness in small space ?¨ What class of DTD can be validated in small memory when XML

document streams in ?

110

Validating Well-formedness in streaming setting¨ Streaming XML document¨ Can we check if the document is well-formed in small space ?

Well-formedness of XML Documents

¨ Open and close tags of XML documents must be well-formed

112

<article> <title>

A Relational Model for Large Shared Data Banks <authors> </title> <author>

<name>E. F. Codd

</name></author> </article>

<article> <title>

A Relational Model for Large Shared Data Banks <authors> </title> <author>

<name>E. F. Codd

</name></author> </article>

Validating Well-formedness in streaming setting [MMN10]¨ Streaming XML document¨ Can we check if it is well-formed in small space ?¨ Grammar of well-formed parentheses of s types: ¨ If we can validate for , we can also validate for with blow up

in space

Validating Well-formedness in streaming setting [MMN10]

¨ Validating for

– Example: – – Matching pair: ,


¨ Validating for

– Example: – – Matching pair: ,


¨ Define two hash functions g, for any subword as ¨ where

¨ h

If v is well-formed g(v)=h(v)=0 else probability that they are both 0 is very low

𝑝𝑖𝑠 𝑎𝑝𝑟𝑖𝑚𝑒 𝑖𝑛𝑏𝑒𝑡𝑤𝑒𝑒𝑛𝑛 {1+𝑐 }𝑎𝑛𝑑𝑛2 {1+𝑐 }𝑎𝑛𝑑𝛼 , 𝛽∈𝑢𝑛𝑖𝑓 [0 ,𝑝−1]


Algorithm (key idea)¨ Read parentheses and reduce them to the form wW where w

consists of only down steps and W consists of only upsteps¨ If w is empty,

– construct hashes for W and compute its length: push (g(W),h(W),|W|) in the stack

¨ Else – construct hashes for w, pop (g,h,l) from the stack, update

g=g+g(w), h=h+h(w), l=l-1 and push back to stack– If l=0 and both g and h are not identically to 0 ERROR– Construct hashes for W along with its length and insert in

the stack

Repairing Malformedness Efficiently [KSSY13]¨ Repairing based on edit distance

Repairing Malformedness [KSSY13]

¨ In the streaming setting only very restricted errors can be repaired¨ When there is sufficient memory to hold the entire XML document,

near linear time algorithms can be devised with guaranteed performance

¨ Extension to consider position of text¨ Extension to return multiple edits using branch and bound

Open Problems¨ Many learning problems are based on lattice structure

– Exploit this structure better– Example: CFD pattern tableaux learning uses partial greedy set cover. Can

we design a careful algorithm which will beat in the approximation bound ?

Open Problems¨ Streaming and distributed setting both for learning and detection are

extremely important– Very basic results so far– Data placement, replication become very useful for distributed

processing

Open Problems¨ Semistructured Data

– What is the most general model that is tractable (validation+repair) in different computation model for XML ?

– Learning distributions of types of errors Language Edit Distance Problem

Open Problems¨ Crowdsourcing

– Use crowd to distinguish between data and error – Extend crowd-based entity resolution technique to handle matching

dependencies– Model errors made by crowd themselves

Open Problems¨ Crowdsourcing

– Use crowd to distinguish between data and error – Extend crowd-based entity resolution technique to handle matching

dependencies– Model errors made by crowd themselves

?