35
1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

Embed Size (px)

Citation preview

Page 1: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

1

Approximate Schemas and Data Exchange

Michel de Rougemont

University Paris II & LRI

Joint work with Adrien Vielleribière,

University Paris-South

Page 2: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

2

1. Classical Data Exchange on words and trees

2. Approximation based on Property Testing

3. Tester for regular words and regular trees with the Edit Distance with Moves

4. Approximate Data Exchange

5. Composition of Data Exchange setting

Plan

Page 3: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

3

1. Data Exchange on Words and Trees

0000011110*1*

?c(ab)*ca*

<!ELEMENT db (work*)><!ELEMENT work (author*)> <!ATTLIST work title CDATA #REQUIRED year CDATA><!ELEMENT author (EMPTY)> <!ATTLIST author name CDATA #REQUIRED>

<!ELEMENT bib (livre*)><!ELEMENT livre (auteur+, titre , annee)><!ELEMENT auteur #PCDATA><!ELEMENT titre #PCDATA><!ELEMENT annee #PCDATA>

Sources Targets

Page 4: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

4

Transducer is an XSLT program:

Transducers transform the data

00011110*1*

cabababcaaaac(ab)*ca*

…..<xsl:template match="work"> <livre><xsl:apply-templates/> <titre><xsl:value-of select="@title" /></titre></livre></xsl:template>…..

0:ab 1:a abababaaaa c(ab)*ca*

Page 5: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

5

Data Exchange setting: (KS ,τ,KT ):• Fagin et al. 2002: τ defined by Source-Target-Dependencies on relations

• Libkin et al. 2005: τ defined by Tree-Pattern-Formulas on trees

1. Source-Consistency: Given a source structure I in KS, is there a target J in KT s.t. (I,J) in τ ?

2. Typechecking: decide if for all I in KS , there is a target J in KT s.t. (I,J) in τ.

3. Composition of settings ?

Main Problems

Page 6: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

6

Data Exchange setting: (KS ,τ,KT ), where τ is a transducer :

• ε-Source-Consistency: Given a source structure I, is there a source I’ ε-close to KS s.t. τ(I) ε-close to KT ?

• ε-Typechecking: decide if for all I in KS , τ(I) is ε-close to KT.

• ε-Composition of settings.

Approximate Data Exchange

Page 7: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

7

Let F be a property on a class K of structures U

An ε -tester for F is a probabilistic algorithm A such that:• If U |= F, A accepts• If U is ε far from F, A rejects with high probability

A property F is testable if there exists a probabilistic algorithm A s.t.• For all ε it is an ε -tester for F • Time(A) independent of n.

Robust characterizations of polynomials, R. Rubinfeld, M. Sudan, 1994O. Goldreich, S. Goldwasser and D. Ron, Property Testing and its connection to Learning and Approximation, 1996.

Tester usually implies a linear time corrector. (ε1, ε2)-Tolerant Tester.

2. Property Testing

Page 8: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

8

1. Satisfiability : T |= F

2. Approximate Satisfiability T |= F

3. Approximate Equivalence

Image on a class K of trees

F F F

-far from F

Approximate Satisfiability and Equivalence

GF

G

Page 9: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

9

History of Testers

Self-testers and correctors for Linear Algebra ,Blum & Kanan 1989

Robust characterizations of polynomials, R. Rubinfeld, M. Sudan, 1994

Testers for graph properties : k-colorability, Goldreich and al. 1996

Regular languages have testers, Alon et al. 2000s

Testers for Regular tree languages , Mdr and Magniez, 2004

Charaterization of testable properties on graphs, Alon et al. 2005

New areas: Sublinear algorithms, Approximation of decision problems

Page 10: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

10

1. Classical Edit Distance:Insertions, Deletions, Modifications

2. Edit Distance with moves

0111000011110011001

0111011110000011001

3. Edit Distance with Moves generalizes to Ordered Trees

Edit Distances with Moves

'( , ') ; ( , ) ( , ')

W Ldist W W dist W L Min dist W W

Page 11: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

11

Uniform Statistics

W=001010101110 length n, n-k+1 blocks of length k=1/ε

1

1.#....

#)(.

2

1

knn

nWstatu

k

...."00...1" ofnumber #"00...0" ofnumber #

2

1

nn

"11...1" ofnumber #

....2kn

For k=2, n-k+1=11

1

4 1. ( ) . ( )

4 11

2

u stat W Y W

( , ') . ( ) . ( ') ,

dist W W u stat W u stat W for words of similar length,

Distance between words: • NP-complete• Testable, O(1): Sample N subwords of length k: Y(W) and Y(W’) If |Y(w)-Y(w’)| <ε. accept, else reject

( ),statistics on N samples: . ( ) ( ) ,

Y W u stat W Y W

Page 12: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

12

3. Tester for a regular language

W: 0000000000111111111111Y: 000001000011111101111Z: 1111111111110000000000

T: 01001010001011000111010101

a b

0

1

1H A

0.5 / 2

/ 2. ( ) . ( ) . ( )

/ 2

0.5 / 2

u stat W u stat Z u stat Y

0001

1000

)(.25,025,025,025.0

Tstatu

T YW

Z

Automaton A defines L, and a polytope H for u.stats

Tester W in L: • Testable, O(1): compute Y(W),

• If dist(Y(w),H) <ε. accept, else reject Remark: robust to noise.

Page 13: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

13

Pair (A,H)

Blocks, k=2, m=4, | Σ |=4, | Σ| k +1=17:

Boucles de taille 1 bloc: {(aa,ca:1),(bb,2),(cc,ac:3),(dd:4)}

1 2

3 4

a

b

b

ca

cd

d

aa ca

H A

ac cc

bb

dd

Page 14: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

14

Corrector of a regular language

W: 000001000011111101111 is ε –close to L(A)

Deterministic Correction:1. Decomposition in admissible subwords:

000001000011111101111 000001 000111111 1111 2. Decomposition in connected components

000001 000 111111 11113. Recomposition (Moves)

000 000001 111111 1111 distance 3 from W

a b

0

1

1

A

Page 15: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

15

Corrector of an ordered tree

2 moves, dist=2

Automate d’arbre ou DTD: t: l,r r: l,r

Page 16: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

16

XML Corrector: http://www.lri.fr/~mdr/xml/

Page 17: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

17

Applications

Testers: • Estimate the distance between two XML files,• Décide if an XML F is ε-valid,• Décide if two DTDs are close.

Correctors: If an XML file F is ε-close from a DTD,• Find a valid F’ ε-close to F; • Rank XML files for a set of DTD’s (supervised learning)

Program Verification:• Decide if two automata are ε-close in polynomial time.• Approximate Model-Checking: http://www.lri.fr/~mdr/vera/

• Specification language• Model • Distance

Page 18: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

18

Data Exchange setting: (KS ,τ,KT ), where τ is a transducer :

ε-Typechecking: decide if for I in KS , τ(I) ε-close to KT.

Words: τ(KS) ε-close to KT ? Apply the Equivalence Tester in polynomial time, as τ(KS) is regular.

Trees: Similar technique, exponential in |DTD|.

Open problem: Is DTD ε- Equivalence in P ?

4. Approximate Data Exchange: typechecking

Page 19: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

19

Data Exchange setting: (KS ,τ,KT ), where τ is a transducer :

ε-Source-Consistency: Given a source structure I, is there a source I’ ε-close to KS s.t. τ(I) ε-close to KT ?

Words: Case 1: Transducer with one state.

Approximate Data Exchange: Source-Consistency

Sample I

Image by τ, Y(τ(I))

Statistics : test if Y(τ(I)) is ε-close to KT .

Case 2: Transducer with many states. Distinguish between compatible paths.

Page 20: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

20

Data Exchange setting: (KS ,τ,KT ), where τ is a transducer :

ε-Source-Consistency on trees:

Approximate Data Exchange: Source-Consistency

Sampling in T provides Statistics on τ(T). Apply tester on trees.

Page 21: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

21

5. Composition of close settings

Data Exchange settings: (KS1 ,τ,KT1 ), (KS2 ,τ’,KT2 ):

Possible when the schemas are ε-close.• Apply corrector at every stage to define the new τ’’

for (KS1 ,τ’’,KT2 ): Apply corrector to τ (I) and obtain

C1. τ (I) in KT1 then the corrector C for KS2 then τ’ then the corrector C2 for KT2 :

τ’’: C2 . τ’ .C. C1. τ (I)

I τ (I)

Page 22: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

22

Conclusion

1. Data Exchange: Source-Consistency, Typechecking, Composition.

2. Property Testing based Approximation3. Tester and Corrector of regular languages4. Equivalence tester for automata

• Polynomial time approximate algorithm (PSPACE-complete)• Generalization to Buchi automata : approximate Model-Checking• Context-Free Languages: exponential algorithm (undecidable problem)

5. Approximate Data Exchange6. Connection to PAC-Learning

Page 23: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

23

Application to learning

Model: take random words according to a distribution D:

U.stat representation:

Negative examples could include the distance.

Learning algorithm: convex hulls of positive examples.

Page 24: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

24

PAC learning

The regular language is a polytope for u.stat.

Polytopes have a finite VC dimension. Hence they are PAC learnable.

Problem: the learnt concept may be ε-far from the language L.

For special distributions D, it may be ε-close. Example: D is uniform and the polytopes are « large ».

Page 25: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

25

Block and Uniform statistics

W=001010101110 length n, b.stat: consecutive subwords of length k, n/k blocksu.stat: any subwords of length k, n-k+1 blocks

1401

61)(.

Wstatb

#....

#

/1)(.

2

1

kn

n

knWstatb ....

"00...1" ofnumber #"00...0" ofnumber #

2

1

nn

"11...1" ofnumber #

....2kn

For k=2, n/k=6 2

441

111)(.

Wstatu

1)'(.)(. :studyMain WstatuWstatu

1k

Page 26: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

26

Tester for equality of strings

Edit distance with moves. NP-complete problem, but approximable in constant time with additive error.

Uniform statistics ( ): W=001010101110

Theorem 1. |u.stat(w)-ustat(w’)| approximates dist(w,w’) .

Sample N subwords of length k, compute Y(w) and Y(w’):

Lemma (Chernoff). Y(w) approximates u.stat(w).

Corollary. |Y(w)-Y(w’)| approximates dist(w,w’) .

Tester: If |Y(w)-Y(w’)| <ε. accept, else reject.

1)(

...1

Ni

iXN

wY

0...010

iX

2441

111)(.

Wstatu

1)'(

...1

Ni

iXN

wY

1k

Page 27: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

27

Let F be a property on strings.

Soundness: ε-close strings have close statistics

Robustness: ε-far strings have far statistics

F is Equality on pairs of strings.For theorem 1, we prove:

1. b.stat is robust2. u.stat is sound3. u.stat is robust

Soundness and Robustness

.)',( nwwdist

.)',( nwwdist

Page 28: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

28

Robustness of b.stat

Robustness of b-stat: ).)'(.)(. .21()',( nwstatbwstatbwwdist

.)',( then )'(.)(. If nwwdistwstatbwstatb

)'()''( t.s. 'w'construct then )'(.)(. If wstatbwstatbwstatbwstatb

1401

61)(.

Wstatb

1302

61)'(.

Wstatb

in W' 3 andin W 4 "10" #but in W' 2 andin W 1"00"#

: Example on w. onssubstituti )'(.)(.2

most at after wstatbwstatb.n

"10" intoit change andin W "00" ofblock one take:'W'

Page 29: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

29

Soundness of u.stat

Soundness of u-stat:

Simple edit:

Move w=A.B.C.D, w’=A.C.B.D:

Hence, for ε2.n operations,

Remark: b.stat is not sound.Problem: robustness of u.stat ? Harder! We need an auxiliary distribution and two key lemmas.

.6)'(.)(. .)',( 2 wstatuwstatunwwdist

.2

12)'(.)(.

nknkwstatuwstatu

.6

1)1(3.2)'(.)(. nkn

kwstatuwstatu

.6)'(.)(. wstatuwstatu

Page 30: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

30

Statistics on words

k

k

Kt k-t

Block statistics: b.stat

Uniform statistics: u.stat

Block Uniform statistics: bu.stat

1k

)(. ii vstatbX )(. 11 vstatbX

1v iv

))(.())(.()(./,...1

vstatbEvstatbEnKwstatbu

Kniiti

. 2kcK

Page 31: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

31

Uniform Statistics

ABKnkbu )1).(1( : by missedk length of subwords#

., onsdistributi uniform twoand ALet : Lemma BA BA

BB

AB .2.Then BA

).

()(.)(. 4

/2

nOwstatbuvstatu

/2

3. ,1 with lemma previous Apply the

nKknB

.)(. )(. w 4

/2

nwstatuwstatbu

Lemma 2:

Page 32: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

32

Block Uniform Statistics

))(.())(.()(./,...1

vstatbEvstatbEnKwstatbu

Kniiti

1][0 ],)[(.][ ),(. uXuvstatbuXvstatbX iiiii

])[(. is on Average t.independen is ][Each uwstatbui uXi

2Kn-8

e]])[(.])[(.])[(.Pr[ : Bound Chernofft

uwstatbutuwstatbuuvstatb 2

Kn-8k

.e])(.)(.)(.Pr[ : BoundUnion t

wstatbutwstatbuvstatb 0]

2)(.)(.Pr[

2. tandn enough largeFor k

wstatbuvstatb

cw)dist(v, and 2

)(.)(. vw vstatbwstatbuLemma 1:

Page 33: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

33

Robustness of the uniform Statistics

Robustness of u-stat:

By Lemma 1:

By Lemma 2:

.5,6)'(. )(. .5)',( wstatuwstatunwwdist

2)(.)(. vw vstatbwstatbu

.)(. )(. w 4

/2

nwstatuwstatbu

w' w,from close v'Get v,

stat.u- of robustness impliesstat -b of Robustness

Tolerant tester:

Theorem: for two words w and w’ large enough, the tester:1. Accepts if w=w’ with probability 1 2. Accepts if w,w’ are ε2-close with probability 2/33. Rejects if w,w’ are ε-far with probability 2/3

..5)',( ).)'(.)(. .21( :bstat of Robustness nwwdistnwstatbwstatb

.5)'( )( ifAccept ),O(cN wYwY

Page 34: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

34

Membership and Equivalence tester

Membership Tester for w in L (regular):1. Construction of the tester: Precompute Hε 2. Tester: Compute Y(w) (approx. b.stat(w)). Accept iff Y(w) is at distance less than ε to Hε

Construction: Time is Tester: query complexity in time complexity inRemark 1: Time complexity of previous testers was exponential in m.Remark 2: The same method works for L context-free.

Tester of 1. Compute Hε,A and Hε,B

2. Reject if Hε,A and Hε,B are different.

Time polynomial in m=Max(|A |, |B |):

BA

O(k).

m

O(k)

O(k).

m

2O(k).

Page 35: 1 Approximate Schemas and Data Exchange Michel de Rougemont University Paris II & LRI Joint work with Adrien Vielleribière, University Paris-South

35

Generalizations

Buchi Automata. Distance on infinite words:Two words are ε-close if

A word is ε-close to a language L if there exists w’ in L s. t. W and w’ are ε-close.

Statistics: set of accumulation points of

H: compatible loops of connected components of accepting states

Tester for Buchi Automata: Compute HA and HB

Reject if HA and HB are different.

Equivalence of CF grammars is undecidable, Approximate equivalence in exponential.

(n))w'dist(w(n), lim sup n

w(n))(. nstatb