85
1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

Embed Size (px)

Citation preview

Page 1: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

1

Grammatical inference Vs Grammar induction

London 21-22 June 2007

Colin de la Higuera

Page 2: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

2cdlh

Summary

1. Why study the algorithms and not the grammars

2. Learning in the exact setting3. Learning in a probabilistic setting

Page 3: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

3cdlh

1 Why study the process and not the result?

Usual approach in grammatical inference is to build a grammar (automaton), small and adapted in some way to the data from which we are supposed to learn from.

Page 4: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

4cdlh

Grammatical inference

Is about learning a grammar given information about a language.

Page 5: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

5cdlh

Grammar induction

Is about learning a grammar given information about a language.

Page 6: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

6cdlh

Difference?

Data G

Grammar induction

Grammatical inference

Page 7: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

7cdlh

Motivating* example #1

Is 17 a random number? Is 17 more random than 25? Suppose I had a random

number generator, would I convince you by showing how well it does on an example? On various examples ?*(and only slightly provocative)

Page 8: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

8cdlh

Motivating example #2

Is 01101101101101010110001111 a random sequence?

What about aaabaaabababaabbba?

Page 9: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

9cdlh

Motivating example #3

Let X be a sample of strings. Is grammar G the correct grammar for sample X?

Or is it G’ ? Correct meaning something

like “the one we should learn”

Page 10: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

10cdlh

Back to the definition

Grammar induction and grammatical inference are about finding a/the grammar from some information about the language.

But once we have done that, what can we say?

Page 11: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

11cdlh

What would we like to say?

That the grammar is the smallest, best (re a score). Combinatorial characterisation

What we really want to say is that having solved some complex combinatorial question we have an Occam, Compression-MDL-Kolmogorov like argument proving that what we have found is of interest.

Page 12: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

12cdlh

What else might we like to say?

That in the near future, given some string, we can predict if this string belongs to the language or not.

It would be nice to be able to bet £100 on this.

Page 13: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

13cdlh

What else would we like to say?

That if the solution we have returned is not good, then that is because the initial data was bad (insufficient, biased).

Idea: blame the data, not the algorithm.

Page 14: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

14cdlh

Suppose we cannot say anything of the sort?

Then that means that we may be terribly wrong even in a favourable setting.

Page 15: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

15cdlh

Motivating example #4

Suppose we have an algorithm that ‘learns’ a grammar by applying iteratively the following two operations: Merge two non-terminals whenever

some nice MDL-like rule holds Add a new non-terminal and rule

corresponding to a substring when needed

Page 16: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

16cdlh

Two learning operatorsCreation of non terminals and rules

NP ART ADJ NOUNNP ART ADJ ADJ NOUN

NP ART AP1NP ART ADJ AP1AP1 ADJ NOUN

Page 17: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

17cdlh

Merging two non terminals

NP ART AP1NP ART AP2AP1 ADJ NOUNAP2 ADJ AP1

NP ART AP1AP1 ADJ NOUNAP1 ADJ AP1

Page 18: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

18cdlh

What is bound to happen?

We will learn a context-free grammar that can only generate a regular language.

Brackets are not found.

This is a hidden bias.

Page 19: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

19cdlh

But how do we say that a learning algorithm is good?

By accepting the existence of a target.

The question is that of studying the process of finding this target (or something close to this target). This is an inference process.

Page 20: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

20cdlh

If you don’t believe there is a target?

Or that the target belongs to another class

You will have to come up with another bias. For example, believing that simplicity (eg MDL) is the correct way to handle the question.

Page 21: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

21cdlh

If you are prepared to accept there is a target but..

Either the target is known and what is the point or learning?

Or we don’t know it in the practical case (with this data set) and it is of no use…

Page 22: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

22cdlh

Then you are doing grammar induction.

Page 23: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

23cdlh

Careful

Some statements that are dangerous Algorithm A can learn {anbncn: nN} Algorithm B can learn this rule with

just 2 examples Looks to me close to wanting free

lunch

Page 24: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

24cdlh

A compromise

You only need to believe there is a target while evaluating the algorithm.

Then, in practice, there may not be one!

Page 25: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

25cdlh

End of provocative example

If I run my random number generator and get 999999, I can only keep this number if I believe in the generator itself.

Page 26: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

26cdlh

Credo (1)

Grammatical inference is about measuring the convergence of a grammar learning algorithm in a typical situation.

Page 27: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

27cdlh

Credo(2)

Typical can be: In the limit: learning is always

achieved, one day Probabilistic

There is a distribution to be used (Errors are measurably small)

There is a distribution to be found

Page 28: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

28cdlh

Credo(3)

Complexity theory should be used: the total or update runtime, the size of the data needed, the number of mind changes, the number and weight of errors…

…should be measured and limited.

Page 29: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

29cdlh

2 Non probabilistic setting

Identification in the limit Resource bounded

identification in the limit Active learning (query

learning)

Page 30: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

30cdlh

Identification in the limit

The definitions, presentations The alternatives

Order free or not Randomised algorithm

Page 31: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

31cdlh

A presentation is

a function f: NXwhere X is any set, yields: Presentations

Languages

If f(N)=g(N) then yields(f)= yields(g)

Page 32: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

34cdlh

Learning function

Given a presentation f, fn is the set of the first n elements in f.

A learning algorithm a is a function that takes as input a set fn ={f(0),…,f (n-1)} and returns a grammar.

Given a grammar G, L(G) is the language generated/recognised/ represented by G.

Page 33: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

35cdlh

Identification in the limit

L Pres NXA class of languages

A class of grammars

G

L A learnerThe naming function

yields

a

f(N)=g(N) yields(f)=yields(g)

n N :k>n L(a(fk))=yields(f)

Page 34: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

36cdlh

What about efficiency?

We can try to bound global time update time errors before converging mind changes queries good examples needed

Page 35: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

37cdlh

What should we try to measure?

The size of G ? The size of L ? The size of f ? The size of fn ?

Page 36: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

38cdlh

Some candidates for polynomial learning

Total runtime polynomial in ║L║ Update runtime polynomial in ║L║ # mind changes polynomial in ║L║ # implicit prediction errors

polynomial in ║L║ Size of characteristic sample

polynomial in ║L║

Page 37: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

39cdlh

f1

G1

a

f(0)f2

G2

a

f(1)fn

Gn

a

f(n-1)fk

Gn

a

f(k)

Page 38: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

40cdlh

Some selected results (1)

DFA text informant

Runtime no no

Update-time

“ yes

#IPE “ no

#MC “ ?

CS “ yes

Page 39: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

41cdlh

Some selected results (2)

CFG text informant

Runtime no no

Update-time

“ yes

#IPE “ no

#MC “ ?

CS “ no

Page 40: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

42cdlh

Some selected results (3)

Good Balls text informant

Runtime no no

Update-time

yes yes

#IPE yes no

#MC yes no

CS yes yes

Page 41: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

43cdlh

3 Probabilistic setting

Using the distribution to measure error

Identifying the distribution Approximating the distribution

Page 42: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

44cdlh

Probabilistic settings

PAC learning Identification with probability 1 PAC learning distributions

Page 43: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

45cdlh

Learning a language from sampling

We have a distribution over * We sample twice:

Once to learn Once to see how well we have

learned The PAC settingProbably approximately correct

Page 44: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

46cdlh

PAC learning(Valiant 84, Pitt 89)

L a set of languages G a set of grammars and m a maximal length over the

strings n a maximal size of grammars

Page 45: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

47cdlh

Polynomially PAC learnable

There is an algorithm that samples reasonably and returns with probability at least 1- a grammar that will make at most errors.

Page 46: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

48cdlh

Results

Using cryptographic assumptions, we cannot PAC learn DFA.

Cannot PAC learn NFA, CFGs with membership queries either.

Page 47: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

49cdlh

Learning distributions

No error Small error

Page 48: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

50cdlh

No error

This calls for identification in the limit with probability 1.

Means that the probability of not converging is 0.

Page 49: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

51cdlh

Results

If probabilities are computable, we can learn with probability 1 finite state automata.

But not with bounded (polynomial) resources.

Page 50: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

52cdlh

With error

PAC definition But error should be measured by a

distance between the target distribution and the hypothesis

L1,L2,L ?

Page 51: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

53cdlh

Results

Too easy with L Too hard with L1

Nice algorithms for biased classes of distributions.

Page 52: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

54cdlh

For those that are not convinced there is a difference

Page 53: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

55cdlh

Structural completeness

Given a sample and a DFAeach edge is used at least onceeach final state accepts at least one string

Look only at DFA for which the sample is structurally complete!

Page 54: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

56cdlh

not structurally complete… X+={aab, b, aaaba, bbaba} add

and abba b

a

a

a

b

b

b

Page 55: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

57cdlh

Question

Why is the automaton structurally complete for the sample ?

And not the sample structurally complete for the automaton ?

Page 56: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

58cdlh

Some of the many things I have not talked about

Grammatical inference is about new algorithms

Grammatical inference is applied to various fields: pattern recognition, machine translation, computational biology, NLP, software engineering, web mining, robotics…

Page 57: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

59cdlh

And

Next ICGI in Britanny in 2008 Some references in the 1 page

abstract, others on the grammatical inference webpage.

Page 58: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

60cdlh

Appendix, some technicalities

Size of G Size of L

#MC

PAC

Size of f

#IPE

Runtimes

#CS

Page 59: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

62cdlh

The size of L If no grammar system is given,

meaningless If G is the class of grammars then ║L║

= min{║G║ : GG L(G)=L} Example: the size of a regular

language when considering DFA is the number of states of the minimal DFA that recognizes it.

Page 60: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

63cdlh

Is a grammar representation reasonable?

Difficult question: typical arguments are that NFA are better than DFA because you can encode more languages with less bits.

Yet redundancy is necessary!

Page 61: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

64cdlh

Proposal

A grammar class is reasonable if it encodes sufficient different languages.

Ie with n bits you have 2n+1 encodings so optimally you should have 2n+1 different languages.

Allow for redundancy and syntaxic sugar, so p(2n+1) different languages.

Page 62: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

65cdlh

But We should allow for redundancy and

for some strings that do not encode grammars.

Therefore a grammar representation is reasonable if there exists a polynomial p() and for any n the number of different languages encoded by grammars of size n is at least p(2n)

Page 63: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

66cdlh

The size of a presentation f

Meaningless. Or at least no convincing definition comes up.

But when associated with a learner a we can define the convergence point Cp(f,a) which is the point at which the learner a finds a grammar for the correct language L and does not change its mind.

Cp(f,a)=n : mn, a(fm)= a(fn)L

Page 64: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

67cdlh

The size of a finite presentation fn

An easy attempt is n But then this does not represent

the quantity of information we have received to learn.

A better measure is in|f(i)|

Page 65: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

68cdlh

Quantities associated with learner a

The update runtime: time needed to update hypothesis hn-1 into hn when presented with f(n).

The complete runtime. Time needed to build hypothesis hn from fn. Also the sum of all update-runtimes.

Page 66: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

69cdlh

Definition 1 (total time)

G is polynomially identifiable in the

limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)=L(G), Cp(f,a) p(║G║).

(or global-runtime(a)p(║G║))

Page 67: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

70cdlh

Impossible

Just take some presentation that stays useless until the bound is reached and then starts helping.

Page 68: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

71cdlh

Definition 2 (update polynomial time)

G is polynomially identifiable in the limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)=L(G), update-runtime(a)p(║G║).

Page 69: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

72cdlh

Doesn’t work

We can just differ identification Here we are measuring the time it

takes to build the next hypothesis.

Page 70: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

73cdlh

Definition 4: polynomial number of mind changes

G is polynomially identifiable in the limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)=L(G),

#{i : a(fi) a(fi+1)} p(║G║).

Page 71: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

74cdlh

Definition 5: polynomial number of implicit prediction errors

Denote by Gx if G is incorrect with respect to an element x of the presentation (i.e. the algorithm producing G has made an implicit prediction error.

Page 72: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

75cdlh

G is polynomially identifiable in the

limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)=L(G), #{i : a(fi) f(i+1)} p(║G║).

Page 73: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

76cdlh

Definition 6: polynomial characteristic sample

G has polynomial characteristic samples for identification algorithm a if there exists an and a polynomial p() such that: given any G in G, Y correct sample for G, such that when Yfn, a(fn)G and ║Y║ p(║G ║).

Page 74: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

77cdlh

3 Probabilistic setting

Using the distribution to measure error

Identifying the distribution Approximating the distribution

Page 75: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

78cdlh

Probabilistic settings

PAC learning Identification with probability 1 PAC learning distributions

Page 76: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

79cdlh

Learning a language from sampling

We have a distribution over * We sample twice:

Once to learn Once to see how well we have

learned The PAC setting

Page 77: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

80cdlh

How do we consider a finite set?

*

Dm

D≤mPr<

By sampling 1/ ln 1/ examples we can find a safe m.

Page 78: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

81cdlh

PAC learning(Valiant 84, Pitt 89)

L a set of languages G a set of grammars and m a maximal length over the

strings n a maximal size of grammars

Page 79: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

82cdlh

H is -AC (approximately correct)* if

PrD[H(x)G(x)]<

Page 80: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

83cdlh

L(G) L(H)

Errors: we want L1(D(G),D(H))<

Page 81: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

84cdlh

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

Page 82: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

85cdlh

4

1

3

1

2

1

2

1

2

13

2

a

b

a

b

a

b

4

3

2

1

Pr(abab)=24

1

4

3

3

2

3

1

2

1

2

1

Page 83: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

86cdlh

0.1

0.3

a

b

a

b

a

b

0.65

0.35

0.9

0.7

0.3

0.7

Page 84: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

87cdlh

4

1

3

1

2

1

2

1

2

13

2

b

b

a

a

a

b

4

3

2

1

Page 85: 1 Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera

88cdlh

4

1

3

1

2

1

2

1

2

13

2

b

a

b

4

3

2

1