1/23 Learning from positive examples Main ideas and the particular case of CProgol4.2 Daniel...

Preview:

Citation preview

1/23

Learning from positive examples

Main ideas and the particular case of CProgol4.2

Daniel Fredouille, CIG talk,11/2005

2/23

What is it all about?

• Symbolic machine learning.• Learning from positive examples instead

of positive and negative examples.• The talk contains two parts:

1. General ideas and tactics to learn from positives.

2. How the particular ILP system CProgol 4.4 of S. Muggleton (1997) deals with positive only learning

3/23

Disclaimer

• This talk has not been extracted from a survey or any article in particular: this is more like a patchwork of my experiences in the domain and how I interpret them.

• Feel free to criticize: I would like feedback on these ideas since I never shared them before.

• I would really appreciate comments on the

slides with the ? sign.

4/23

Definitions

Concept space Instances space

orde

ring

Inferred concept C’

Positive/Negative example of CTarget concept C

• Is more general / less specific than• The concept space is usually partially ordered with this relation

5/23

Positive and Negative Learning

Possibility 1: Discrimination of classes• Characterise the difference in the pos/neg examples• No model of the positive concept !

?

6/23

Positive and Negative Learning

Possibility 2: Characterisation of a class• Use neg. examples to prevent over-generalisation• Needs neg. examples “close” to the concept border

?

7/23

Positive Only Learning

Aim: Characterisation of a class

Choice ?

8/23

Positive Only Learning

• Two strategies:1. Bias in the search space: choosing a space

with a (very) strong structure.

2. Bias in the evaluation function: choose a concept with a compromise between:– Generality/specificity of the concept– Coverage of the positives by the concept– Complexity of the hypothesis representing the

concept

?

9/23

Search space bias approach

• Main idea: consider strongly organised concept spaces

• Possible inference algorithm:– Select the concept the least general covering all

examples.– The constraints on the search space ensures there is

only one such concept.

Trivial example (generally not useful), “tree organisation”:

10/23

Search space bias approach

• Advantages: – Strong theoretical convergence results possible.– Can lead to (very) fast inference algorithms.

• Drawback:– Not available for all concepts spaces!– Theorem: super-finite classes of concepts are not

inferable in the limit this way (Gold 69).Super-finite = contains all concepts covering a finite number of examples and at least one concept covering an infinity.

11/23

Heuristic Approach

• Scoring making a compromise between:1. Specificity of the concept

2. Coverage of the positives by the concept

3. Complexity of the concept

• Implementations: – Ad-hoc measure of points 1, 2, 3 and combination in

a formulae, e.g.: Score = Coverage + Specificity – Complexity

– Minimum Message Length ideas (~MDL)

?

12/23

Heuristic Approach: Ad-hoc implementation

• Elements of the score– Coverage: counting covered instances– Specificity: measure of the “proportion” of instances of

the space covered– Complexity: the size of the concept representation

(e.g., number of rules)• Advantages:

– Usually easy to implement– Usually provides parameters to tune the compromise

• Disadvantage: – No theory– Bias not always clear– How to combine coverage/specificity/complexity?

?

13/23

Heuristic Approach: MML implementation

Canal

Examples

Hyp. Examples classes ¦ Hyp classes

0100101001011010101110101

Canal Examples and classesHyp.

00101101010111011101101

Examplesand classes ¦ Hyp

MML for discrimination

MML for characterisation

Gain = number of bits needed to send the message without compression – number of bits needed to send the message with compression.

?

14/23

Heuristic Approach: MML implementation

• Advantages:– Some theoretical justifications in Kolmogorov/

Solomonov/ Ockam/ Bayes/ Chaitin works.– Absolute and meaningful score.

• Disadvantage:– Limit of the theory: the optimal code can NOT

be computed !– Difficult implementation:

the choices of the encoding creates the inference biases, this is not very intuitive.

15/23

Positive only learning in ILP with CProgol4.2

16/23

Positive only learning in ILP• The following is not a survey! This is from what I

already encountered but I have not looked for further references.

• MML implementations– Muggleton [88]– Srinivasan, Muggleton, Bain [93]– Stahl [96]

• Other implementations:– Muggleton CProgol4.2 [97]– Heuristic had-hoc method– Somehow based on MML, but the implementation

details makes it quite different.

17/23

CProgol4.2 uses Bayes

DH DI DI ¦h

h H i I

Score: P(h ¦ E) = P(h) * P(E ¦ h) / P(E) • Fixing distributions and computing P(h), P(E ¦ h), P(E)

h

IH

E

18/23

Assumptions for the distributions

• P(h) = e- size(h)

– Large theories are less probable than small ones

– size(h) = sum over the rules ci of h of the number of literals in the body of ci

• P(E ¦ h) = ΠeE DI¦h(e) = ΠeE DI (e) / DI (h)

– Assumption that DI and DH gives DI¦h

– Independence assumption between examples

19/23

Replacing in Bayes

• P(h ¦ E) =

e- size(h) * [ ΠeE DI (e) / DI (h) ] / P(E)

• As we want to compare hypotheses:= [e- size(h) / DI (h)|E|] * Cste1

• Take the log: ln(P(h ¦ E)) = -size(h) + |E| * ln(1/DI (h)) + Cste2

• We still have to compute DI (h) ...

20/23

DI (h): weight of h in the instance set

• Computing DI:

– Using a stochastic logic program S trained with the BK to model DI (not included in the talk)

• Computing DI(h):

– Generate R instances from DI

– h covers r of them

– DI (h) = (r+1) / (R+2)H

21/23

Formulae for a whole theory covering E

• ln(P(h ¦ E)) = -size(h) - |E| * ln((r+1)/(R+2)) + C2

Complexity SpecificityCoverage

Estimation of final theory score from a partially inferred theory:• ln(P(h’ ¦ E)) =

|E|/p * size(h’) - |E| * ln( |E|/p * (r’+1)/(R+2)) + C3

22/23

Final evaluation

• Suppression of |E| and C2:– f(h’) = size(h’) /p + ln(p) - ln(|E| * (r’+1)/(R+2))

• Possible boost of positives with k:– size(h’)/(k*p) + ln(k*p) - ln( |E|*(r’+1)/(R+2) )

• The formulae is not written anywhere (the above one is my best guess !).

• The papers are hard to understand• But it seems to work ...

Complexity SpecificityCoverage

23/23

Conclusion

• Learning from positives only is a real challenge and methods from positive and negatives can hardly be adapted.

• Some nice theoretical frameworks exist. • When it gets to implementing heuristic

frameworks:– The theory is often lost in approximations and choices

of implementation.– Useful systems can be created but tuning and

understanding the biases have to be considered as very important stages of inference.

Recommended