Grammatical Inference: Learning Automata and …Arewa nihongo de nanto iimasu ka?) Colin de la Higuera, Nantes, 2016 17 Transducers can be used to translate Inversely, translating

www.univ-nantes.frwww.univ-nantes.frwww.univ-nantes.fr

Grammatical Inference:

Learning Automata and

Grammars

Colin de la Higuera, 2016

Kyoto, 11th April 2016

http://creativecommons.org/licenses/by/4.0/


1

Nantes



I am

• Professor at Nantes University

• Former President of the Société informatique de

France

• Researcher in Machine Learning

• Guest scholar at Akutsu Lab, University of Kyoto

• Jan-Jun 2016

2

2

Starting point: Machine Learning

Colin de la Higuera, Nantes, 2016 3



What is machine learning? (1)


• What does a Universal Turing machine do?

– It takes the data and the code and runs the code on the data

– The code is therefore also data

• Next step (as proposed by Turing in 1948):

– The learning machine

– Takes the code and the data and returns the code transformed

Alan Turing, On Computable Numbers with an Application to the Entscheidungsproblem, Proc. London Math.

Soc., 2nd series, vol. 42, 1937, p. 230-265

Alan Turing, Computing machinery and intelligence, Mind, Oxford University Press, vol. 59, no 236, 1950

Alan Turing, 1912-1954

Turing’s dream



http://www.thocp.net/biographies/papers/turing_oncomputablenumbers_1936.pdf

https://fr.wikipedia.org/wiki/London_Mathematical_Society#Publications

https://fr.wikipedia.org/wiki/Computing_machinery_and_intelligence

https://fr.wikipedia.org/w/index.php?title=Mind&action=edit&redlink=1

https://fr.wikipedia.org/wiki/Oxford_University_Press


• Let the data decide (not the algorithm)

• The algorithm can be used to organise, index,

search

• Not more

• Typical (or extreme?) application of this idea:

k-nearest neighbours


The big data project

Image: https://commons.wikimedia.org/wiki/File:Big_Bang_Data_exhibit_at_CCCB_17.JPG



https://commons.wikimedia.org/wiki/File:Big_Bang_Data_exhibit_at_CCCB_17.JPG


• Use the data to build a model:

– A model is a way to

• Compress the data

• Interpret the data

• Forget the data


The pragmatic approach

My talk at KU-ICR on Thursday



How do we choose the model?

• This depends on the data

• When the data is made of points in a 2 dimension space,

this is easy

• The model can be a half-plane, a line, a polynomial






this is easy







this is easy







this is (still) easy

• The model can be a hyperplane, a separating plane, a

polynomial






• When the data is made of points in a high dimension

space, this is still possible (with linear algebra)

• The model can be a hyperplane, a separating hyperplane, a

function


Image: https://i.ytimg.com/vi/Kk6rd4_dAqA/maxresdefault.jpg



https://i.ytimg.com/vi/Kk6rd4_dAqA/maxresdefault.jpg

Typical techniques today

• Support vector machines

• (deep) Neural networks


Images: https://upload.wikimedia.org/wikipedia/commons/1/10/Svm_10_perceptron.JPG

https://upload.wikimedia.org/wikipedia/commons/3/32/Single-layer_feedforward_artificial_neural_network.png



https://upload.wikimedia.org/wikipedia/commons/1/10/Svm_10_perceptron.JPG

https://upload.wikimedia.org/wikipedia/commons/3/32/Single-layer_feedforward_artificial_neural_network.png

A research program for computer scientists

• We know how to manipulate strings, trees, graphs

• They are good for modelling

• They contain precious information about the interactions

• Why lose their power?

• Goal is therefore to learn from (such) structured data, and

learn models adapted to such data




A comparison

AKA Pros Cons

Vector space

machine

learning

Statistical

pattern

recognition

Robust, many algorithms

and methods.

Existence of a topology

Black box effect.

Difficult to

understand

Rich data

representations

Structural

pattern

recognition

Richer representation.

Possibility of capturing

the interactions

Intelligibility

Often less noise

resistant.

Often more

expensive

14

14

The challenge

• When input is a set of strings, why not learn an

automaton, a formal grammar?

• Ie a model designed to represent languages!




The data for grammatical inference




The data: examples of strings

A sentence in English and its translation to Japanese:

• What's that called In Japanese?

• あれは日本語で何といいますか。

(Arewa nihongo de nanto iimasu ka?)




Transducers can be used to translate

Inversely, translating

これは、ウサギですこれは、ウサギのですか

also needs differing


This:これは is: λa: λ

cat: 猫です

computer:

は、コンピュータ




• Time series pose the problem of the alphabet:

– An infinite alphabet?

– Discretizing?

– An ordered alphabet


Sinus rhythm with acquired long QT, work found via Flickr, by Popfossa, CC BY 2.0



https://www.flickr.com/photos/popfossa/3992549630

https://www.flickr.com/photos/popfossa/

https://creativecommons.org/licenses/by-nc/2.0/



Codis profile, Chemical Science & Technology Laboratory, National Institute of Standards and Technology, work found via Wikipedia, CC BY-SA 3.0



http://www.cstl.nist.gov/div831/strbase/fbicore.htm

http://en.wikipedia.org/wiki/Variable_number_tandem_repeat#mediaviewer/File:Codis_profile.jpg

http://creativecommons.org/licenses/by-sa/3.0/


>A BAC=41M14 LIBRARY=CITB_978_SKB

AAGCTTATTCAATAGTTTATTAAACAGCTTCTTAAATAGGATATAAGGCAGTGCCATGTA

GTGGATAAAAGTAATAATCATTATAATATTAAGAACTAATACATACTGAACACTTTCAAT

GGCACTTTACATGCACGGTCCCTTTAATCCTGAAAAAATGCTATTGCCATCTTTATTTCA

GAGACCAGGGTGCTAAGGCTTGAGAGTGAAGCCACTTTCCCCAAGCTCACACAGCAAAGA

CACGGGGACACCAGGACTCCATCTACTGCAGGTTGTCTGACTGGGAACCCCCATGCACCT

GGCAGGTGACAGAAATAGGAGGCATGTGCTGGGTTTGGAAGAGACACCTGGTGGGAGAGG

GCCCTGTGGAGCCAGATGGGGCTGAAAACAAATGTTGAATGCAAGAAAAGTCGAGTTCCA

GGGGCATTACATGCAGCAGGATATGCTTTTTAGAAAAAGTCCAAAAACACTAAACTTCAA

CAATATGTTCTTTTGGCTTGCATTTGTGTATAACCGTAATTAAAAAGCAAGGGGACAACA

CACAGTAGATTCAGGATAGGGGTCCCCTCTAGAAAGAAGGAGAAGGGGCAGGAGACAGGA

TGGGGAGGAGCACATAAGTAGATGTAAATTGCTGCTAATTTTTCTAGTCCTTGGTTTGAA

TGATAGGTTCATCAAGGGTCCATTACAAAAACATGTGTTAAGTTTTTTAAAAATATAATA

AAGGAGCCAGGTGTAGTTTGTCTTGAACCACAGTTATGAAAAAAATTCCAACTTTGTGCA

TCCAAGGACCAGATTTTTTTTAAAATAAAGGATAAAAGGAATAAGAAATGAACAGCCAAG

TATTCACTATCAAATTTGAGGAATAATAGCCTGGCCAACATGGTGAAACTCCATCTCTAC

TAAAAATACAAAAATTAGCCAGGTGTGGTGGCTCATGCCTGTAGTCCCAGCTACTTGCGA

GGCTGAGGCAGGCTGAGAATCTCTTGAACCCAGGAAGTAGAGGTTGCAGTAGGCCAAGAT

GGCGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTATGTCCAAAAAAAAAAAAA

AAAAAAAGGAAAAGAAAAAGAAAGAAAACAGTGTATATATAGTATATAGCTGAAGCTCCC

TGTGTACCCATCCCCAATTCCATTTCCCTTTTTTGTCCCAGAGAACACCCCATTCCTGAC

TAGTGTTTTATGTTCCTTTGCTTCTCTTTTTAAAAACTTCAATGCACACATATGCATCCA

TGAACAACAGATAGTGGTTTTTGCATGACCTGAAACATTAATGAAATTGTATGATTCTAT






Cancionero de Palacio, work found via Wikipedia, CC BY-SA 3.0



http://en.wikipedia.org/wiki/Cancionero_de_Palacio#mediaviewer/File:Cancionero_de_Palacio.png

http://creativecommons.org/licenses/by-sa/3.0/)



https://upload.wikimedia.org/wikipedia/commons/3/36/Emperor_family_tree_0_en.p

ng , CC BY-SA 3.0



https://upload.wikimedia.org/wikipedia/commons/3/36/Emperor_family_tree_0_en.png




Phylogenetic Tree, Woese 1990, Maulucioni, work found via Wikipedia, CC BY-SA 3.0



http://commons.wikimedia.org/wiki/User:Maulucioni

http://en.wikipedia.org/wiki/Carl_Woese#mediaviewer/File:PhylogeneticTree,_Woese_1990.PNG




<book>

<part>

<chapter>

<sect1/>

<sect1>

<orderedlist numeration="arabic">

<listitem/>

<f:fragbody/>

</orderedlist>

</sect1>

</chapter>

</part>

</book>





<?xml version="1.0"?>

<?xml-stylesheet href="carmen.xsl" type="text/xsl"?>

<?cocoon-process type="xslt"?>

<!DOCTYPE pagina [

<!ELEMENT pagina (titulus?, poema)>

<!ELEMENT titulus (#PCDATA)>

<!ELEMENT auctor (praenomen, cognomen, nomen)>

<!ELEMENT praenomen (#PCDATA)>

<!ELEMENT nomen (#PCDATA)>

<!ELEMENT cognomen (#PCDATA)>

<!ELEMENT poema (versus+)>

<!ELEMENT versus (#PCDATA)>

]>

<pagina>

<titulus>Catullus II</titulus>

<auctor>

<praenomen>Gaius</praenomen>

<nomen>Valerius</nomen>

<cognomen>Catullus</cognomen>

</auctor>



27

A linguistic tree. (Courtesy of Mark Knauf and Etsuyo

Yuasa, Department of East Asian Languages and

Literatures (DEALL), Ohio State University.)

http://deall.osu.edu/

Parse trees




And also

• Business processes

• Bird songs

• Images (contours and shapes)

• Robot moves

• Observations of protocols, server exchanges

• Interactions between systems

• …




The models in grammatical inference




An HMM


• https://en.wikipedia.org/wiki/Hidden_Markov_model



https://en.wikipedia.org/wiki/Hidden_Markov_model

Another HMM (proteins)

• http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

• And a more interesting example:

• http://www.cbs.dtu.dk/~kj/hmm-real-life-example.pdf




http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

http://www.cbs.dtu.dk/~kj/hmm-real-life-example.pdf

A finite state machine

• https://msdn.microsoft.com/en-us/library/aa478972.aspx




https://msdn.microsoft.com/en-us/library/aa478972.aspx

Another FSM (a transducer)

• The "3-state busy beaver" Turing Machine in a finite state representation. Each

circle represents a "state" of the TABLE—an "m-configuration" or "instruction".

"Direction" of a state transition is shown by an arrow. The label (e.g.. 0/P,R) near

the outgoing state (at the "tail" of the arrow) specifies the scanned symbol that

causes a particular transition (e.g. 0) followed by a slash /, followed by the

subsequent "behaviors" of the machine, e.g. "P Print" then move tape "R Right".

No general accepted format exists. The convention shown is after McClusky

(1965), Booth (1965), Hill and Peterson (1974).

• https://commons.wikimedia.org/




https://commons.wikimedia.org/

A transducer

• Comparing nondeterministic and quasideterministic finite-state transducers

built from morphological dictionaries

• Author Alicia Garrido-Alenda and Mikel L. Forcada

• https://commons.wikimedia.org/




https://commons.wikimedia.org/

Stress patterns transducer

• Example: penult; alt secondary


1,w0,s2

2,w0w0,s2s0

3,w0w0w0,s0s2s0

4,w0w0w0w0w0,s0s1s0s2s0

5,w0w0w0w0w0w0w0,s0s1s0s1s0s2s0

6,w0w0w0w0w0w0w0w0w0,s0s1s0s1s0s1s0s2s0

Adapted from http://st2.ullet.net/



http://st2.ullet.net/

A PCFG (so not only finite state machines)




Summarising

• Finite state models

– DFA

– NFA

– PFA

– HMM

– transducer

• Grammatical models

– Context Free Grammar

– Probabilistic Context-Free Grammar

• (many others)




Partial Conclusion

• If we have some strings and want to learn the

models we have just seen… what do we need?




We need… to solve many problems

related to the models themselves




41

4

1

3

1

2

1

2

1

2

13

2

b

b

a

a

a

b

4

3

2

1

PFA: Probabilistic Finite (state)

Automaton

A PFA

42

Pr(aba) = 0.7*0.4*0.1*1 +0.7*0.4*0.35*0.2 = 0.028+0.0196= 0.0476

0.2

0.1

1

a

b

a

a

a

b

0.45

0.350.4

0.7

0.3

0.1

b 0.4

Parsing with a PFA



1

a 0.3

a 0.7a 0.7

a 0.9 a 0.3

b 0.1

PrA(b)=0.1

PrA(aaaaa)=3*0.9*0.32*0.72=0.119

43

Most probable string is?

Most probable string: problems

Name: Most probable string (MPS)

• Instance: A probabilistic automaton A, a p>0

• Question: Is there in * a string x such that PrA(x) > p?

Name: Consensus string (CS)

• Instance: A probabilistic automaton A,

• Question: Find in * a string x such that y*

PrA(y) PrA(x)

44



Results (cdlh & Oncina 2013)

• Key lemma: if w has probability p, then it has length at most

|A|2/p

• As a corollary MPS is decidable!

• There exists an algorithm solving CS whose complexity is

O(|||A|2/popt2 )

45



Results (recent)

• Suppose we are trying to find the median string. That is the

string minimizing

xSde(w,x)PrD(x)

• then how do we compute this value?

• Currently, we are at least able to compute

xSde(w,x)PrD(x), for a given w.

46



How do we define learning?




What are we hoping for? [the data]

• We are given some strings

• We are given some labelled strings

• We are not given any strings but can ask

questions

• (instead of strings, you can think graphs or trees)




What are we hoping for? [the result]

• Given some strings, perhaps some labels for these

strings, build a FSM

• Eventual extra tasks

– Be robust

– Be fast

– Be able to prove that the result is “good”




Learning models

• We can prove that algorithms « learn »

– that they can identify correctly something

– that they converge, decreasing the generalisation

error




Just one complete example




The problem:

• An agent must take cooperative decisions in a multi-

agent world

• His decisions will depend:

– on what he hopes to win or lose

– on the actions of other agents




Hypothesis:


e e

pp

l

p e

e e p e p le e e d

The opponent follows a rational strategy (given by a

DFA/Moore machine)

ME:

equations or

pictures

YOU:

listen or

doze

l d

An example of a rational

strategy



Example:

• Each prisoner can admit (a) or stay silent (s)

– If both admit: 3 years (prison) each

– If A admits but not B: A=0 years, B=5 years

– If B admits but not A: B=0 years, A=5 years

– If neither admits: 1 year each


The prisoner's dilemma



Example:


a

a

s

s

-3

-3

0

-5

0

-5

-1

-1

AB



• In our version we study an iterated version against an

opponent who follows a rational strategy

• Gain Function: limit of means (average over a very long series

of moves)

• For example, if we get into a recurrent situation where we

both admit, the gain will be -3



The general problem

• We suppose that the strategy of the opponent is given by

a deterministic finite automaton (DFA)

• Can we imagine an optimal strategy?




Running example


s s

aa

a

a s

s s



Running example

• Then (game theory):

– Consider the opponent’s graph in which we value the edges by our own gain

and find the best (infinite) path in the graph


Suppose we know the opponent’s strategy



Running example


Find the cycle of maximum mean weight

Find the best path leading to this cycle of

maximum mean weight

Follow the path and stay in the

cycle



Running example


Find the cycle of maximum

mean weight

Find the best path leading to

this cycle of maximum

mean weight

Follow the path and

stay in the cycle

a s

a

s

-3 0

-5 -1

s s

aa

a

a s

s s-5

0 0

-1

-3 -1

Mean = -0.5

Best path



Question

Can we play a game against this opponent and…

can we then reconstruct his strategy ?




The data (him, me)


a a a s s a a a a s s s s s s a s a

λ a

a a

as s

asa a

asaa a

asaas s

asaass s

HIM ME If I play asa, his move is a



The logic of the algorithm

• The goal is to be able to parse and to have a partial solution consistent

with the data

• The algorithm is loosely inspired by a number of grammatical inference

algorithms

• It is greedy




The algorithm


λa

a ?a a

a

Sure: Have to deal with:

The first decision



The algorithm


a a

The candidates

a

a

Occam’s razor

Entia non sunt multiplicanda praeter necessitatem

"Entities should not be multiplied unnecessarily"



The algorithm


a a

The second decision

a a s

Sure: Have to deal with:

aa aas ?



The algorithm

68

a a

The third decision

a,s a

s

Inconsistent: Consistent:

aa aas s asa ?

s

a

a

s

Have to deal with:

s

a

Colin de la Higuera, Nantes, 2016



The algorithm


The three candidates

a

a

ss a

a

ss

a

a

s s

a a

a



The algorithm

70

a

The fourth decision

a

s

Consistent:

aa aas sasa aasaa aasaas sasaass ?

s

a

a

s

Have to deal with:

s

s

a

a




The algorithm

71

a

The fifth decision

a

s

Inconsistent:

aa aas sasa aasaa aasaas sasaass sasaasss s asaasssa s

s

a,s

a

a

ss

a

s




The algorithm

72

a

The fifth decision

a

s aa aas sasa aasaa aasaas sasaass sasaasss ?

s

a

ss

a

a

ss

a

ss

s

Consistent:

Have to deal with:




The algorithm

73

a

The sixth decision

a

s

aa aas sasa aasaa aasaas sasaass sasaasss sasaasssa s

s

a

ss

a

a

ss

a

ss

sInconsistent:

s




The algorithm

74

a

The sixth decision

a

s

aa aas sasa aasaa aasaas sasaass sasaasss s asaasssa sasaasssa ?

s

a

ss

a

a

ss

a

ss

Consistent:

s

s

Have to deal with:

a




The algorithm

75

a

The seventh decision

a

s

aa aas sasa aasaa aasaas sasaass sasaasss s asaasssa sasaasssa s

s

a

ss

Inconsistent:

s

a




The algorithm


a

The seventh decision

a

s

aa aas sasa aasaa aasaas sasaass sasaasss s asaasssa sasaasssa s

s

a

ss

Consistent:

s

a



The algorithm

77

a

The result

a

ss

a

ss

s

a




How do we get hold of the learning data?

a) through observation (like here)

b) through exploration




An open problem

79

a :20%s :80%

The strategy is probabilistic:

a

s

a

s

s

a

a :50%s :50%

a :70%s :30%




Tit for tat


a

a

ss

a

s



Summarising and concluding




Time to say more about grammatical inference

• Machine learning where the data is strings and the models

are finite state machines

• Many applications (and new ones!)

• Many open questions (in fact, applications direct the

questions)

• Researchers in many countries, including Japan

– Etsuji Tomita, Thomas Zeugmann, Yasubumi Sakakibara,

Ryo Yoshinaka, Makoto Kanazawa, Takashi Yokomori

– And many others!




Acknowledgements

• This presentation includes ideas that have appeared after working

with or reading the works of many people.

• Any list is necessarily arbitrary and insufficient.

• But at least, thanks to:

– Peter Flach (Machine Learning, Cambridge University Press)

– D. Carmel and S. Markovitch. Model-based learning of interaction strategies

in multi-agent systems. Journal of Experimental and Theoretical Artificial

Intelligence, 10(3):309–332, 1998

– D. Carmel and S. Markovitch. Exploration strategies for model-based

learning in multiagent systems. Autonomous Agents and Multi-agent

Systems, 2(2):141–172, 1999







Documents

Grammatical Inference: Learning Automata and …Arewa nihongo de nanto iimasu ka?) Colin de la Higuera, Nantes, 2016 17 Transducers can be used to translate Inversely, translating