Introduction to Machine Learning Laurent Orseau AgroParisTech laurent.orseau@agroparistech.fr EFREI...

Preview:

Citation preview

Introduction to Introduction to Machine LearningMachine Learning

Laurent OrseauAgroParisTech

laurent.orseau@agroparistech.fr

EFREI 2010-2011Based on slides by Antoine Cornuejols

2

OverviewOverview

• Introduction to Induction (Laurent Orseau)• Neural Networks• Support Vector Machines• Decision Trees • Introduction to Data-Mining (Christine Martin)• Association Rules• Clustering• Genetic Algorithms

3

Overview: IntroductionOverview: Introduction

• Introduction to Induction Examples of applications Learning types

• Supervised Learning• Reinforcement Learning• No-supervised Learning

Machine Learning Theory

• What questions to ask?

IntroductionIntroduction

5

What is Machine Learning ?What is Machine Learning ?• Memory

Knowledge acquisition Neurosciences

• Short-term (working) Keep 7±2 objects at a time

• Long-term Procedural

» Action sequences Declarative

» Semantic (concepts) » Episodic (facts)

• Learning Types By heart From rules By imitation / demonstration By trial & error

• Knowledge reuseKnowledge reuse In similar situations

Introduction

6

What is Machine Learning?What is Machine Learning?

• "The field of study that gives computers the ability to learn without being explicitly programmed "

Arthur Samuel, 1959

Samuel's Checkers> Schaeffer 2007 (solved)+ TD-Gammon, Tesauro 1992

Introduction

7

What is Machine Learning?What is Machine Learning?

Given:Experience E, A class of tasks T A performance measure P,

A computer is said to learn if

its performance on a task of T

measured by P

increases with experience E

Tom Mitchell, 1997

Introduction

8

Terms related to Machine LearningTerms related to Machine Learning

• Robotic Automatic Google Cars, Nao

• Prediction / forecasting Stock exchange, pollution peaks, …

• Recognition Face, language, writing, moves, …

• Optimization Subway speed, traveling salesman, …

• Regulation Heat, traffic, fridge temperature, …

• Autonomy Robots, hand prosthesis

• Automatic problem solving• Adaptation

User preferences, robot in changing environment• Induction• Generalization• Automatic discovery• …

Introduction

Some applicationsSome applications

10

Learning to cookLearning to cook

•Learning by imitation / demonstration•Procedural Learning (motor precision)•Object recognition

Applications

11

DARPA Grand challenge (2005)DARPA Grand challenge (2005)

Applications

12

200km of desert

Natural and artificial dangers

No driver

No remote control

200km of desert

Natural and artificial dangers

No driver

No remote control

Applications > DARPA Grand Challenge

13

5 Finalists5 Finalists

Applications > DARPA Grand Challenge

14

Recognition of the roadRecognition of the road

Applications > DARPA Grand Challenge

15

Learning to label images:Learning to label images:Face recognitionFace recognition

“Face Recognition: Component-based versus Global Approaches” (B. Heisele, P. Ho, J. Wu and T. Poggio), Computer Vision and Image Understanding, Vol. 91, No. 1/2, 6-21, 2003.

Applications

16

Applications > Reconnaissance d'images

Feature combinationsFeature combinations

17

Hand prosthesisHand prosthesis

• Recognition of pronator and supinator signals Imperfect sensors Noise Uncertainty

Applications

18

Autonomous robot rover on MarsAutonomous robot rover on Mars

Applications

19

Supervised Supervised LearningLearning

Learning by heart? UNEXPLOITABLE Generalize

How to encode forms?

b

Introduction to Introduction to Machine Learning TheoryMachine Learning Theory

21

Introduction to Machine Learning theoryIntroduction to Machine Learning theory

• Supervised Learning

• Reinforcement Learning

• Unsupervised Learning (CM)

• Genetic Algorithms (CM)

22

Supervised LearningSupervised Learning

• Set of examples xi labeled ui

• Find a hypothesis h so that:

h(xi) = ui ?

h(xi): predicted label

• Best hypothesis h* ?

23

Supervised Learning: 1Supervised Learning: 1stst Example Example

• Houses: Price / m²

• Searching for h Nearest neighbors? Linear, polynomial regression?

• More information Localization (x, y ? or symbolic variable?),

age of building, neighborhood, swimming-pool, local taxes, temporal evolution,…?

Supervised Learning

24

Problem Problem

Prediction du prix du m² pour une maison donnee.

1) Modeling

2) Data gathering

3) Learning

4) Validation

5) Use in real case

Supervised Learning

Ideal Practice

25

1) Modeling1) Modeling

• Input space What is the meaningful information? Variables

• Output space What is to be predicted?

• Hypothesis space Input –(computation) Output What (kind of) computation?

Supervised Learning

26

1-a) Input space: Variables1-a) Input space: Variables

• What is the meaningful information?• Should we get as much as possible?• Information quality?

Noise Quantity

• Cost of information gathering? Economic Time Risk (invasive?) Ethic Law (CNIL)

• Definition domain of each variable? Symbolic, bounded numeric, not bounded, etc.

Supervised Learning > 1) Modeling

27

Price of m²: VariablesPrice of m²: Variables

• Localization Continuous: (x, y) longitude latitude ? Symbolic: city name?

• Age of building Year of creation? Relative to present or to creation date?

• Nature of soil

• Swimming-pool?

Supervised Learning > 1) Modeling > a) Variables

28

1-b) Output space1-b) Output space

• What do we want on output? Symbolic classes? (classification)

• Boolean Yes/No (concept learning)• Multi-valued A/B/C/D/…

Numeric? (regression)• [0 ; 1] ?• [-∞ ; +∞] ?

• How many outputs? Multi-valued Multi-class ?

• 1 output for each class Learn a model for each output?

• More "free" Learn 1 model for all outputs?

• Each model can use others' information

Supervised Learning > 1) Modeling

29

1-c) Hypothesis space1-c) Hypothesis space

• Critical!

• Depends on the learning algorithm Linear Regression: space = ax + b

• Parameters: a and b Polynomial regression

• # parameters = polynomial degree• Neural Networks, SVM, Gen Algo, …

Supervised Learning > 1) Modeling

30

Choice of hypothesis spaceChoice of hypothesis space

Estimation

Error

Total ErrorApproximation Error

31

Choice of hypothesis spaceChoice of hypothesis space

• Space too "poor" Inadequate solutions Ex: model sin(x) with y=ax+b

• Space too "rich" risk of overfittingoverfitting• Defined by set of parametersparameters

High # params learning more difficult

• But prefer a richer hypothesis space! Use of generic methods Add regularization

Supervised Learning > 1) Modeling > c) Hypothesis space

32

2) Data gathering2) Data gathering

• Gathering Electronic sensors Simulation Polls Automated on the Internet …

• Get highest quantity of data Collect cost

• Data as "pure" as possible Avoid all noise

• Noise in variables• Noise in labels!

1 example = 1 value for each variable• missing value = useless example?

Supervised Learning

33

Gathered dataGathered data

x1x1 x2x2 x3x3 uu

Example 1 Yes 1.5 Green -

Example 2 No 1.4 Orange +

Example 3 Yes 3.7 Orange -

… … … … …

Inputs / Variables

measured

Output /

Class /

Label

But true label y

unreachable !

Supervised Learning > 2) Data gathering

34

Data preprocessingData preprocessing

• Clean up data ex: Reduce background noise

• Transform data Final format adapted to task Ex: Fourier Transform of radio signal

time/amplitude frequency/amplitude

Supervised Learning > 2) Data gathering

35

3) Learning3) Learning

a) Choice of program parameters

b) Choice of inductive test

c) Running the learning program

d) Performance test

If bad, return to a)…

Supervised Learning

36

a) Choice of program parametersa) Choice of program parameters

• Max allocated computation time

• Max accepted error

• Learning parameters Specific to model

• Knowledge introduction Initialize parameters to "ok" values?

• …

Supervised Learning > 3) Learning

37

b) Choice of inductive testb) Choice of inductive test

Goal: find hypothesis h H minimizing real riskreal risk (risk expectancy, generalization error)

predictedlabel

true label y(or desired u)

Loss functionLoss functionJoint probability law

over X Y

Supervised Learning > 3) Learning

R(h) l h(x),y dP(x, y)XY

38

Real riskReal risk

• Goal: Minimize real risk

• Real risk is not known, in particular P(X,Y).

Supervised Learning > 3) Learning > b) Inductive test

• Discrimination

• Regression

l (h(xi),ui) 0 si ui h(xi )

1 si ui h(xi )

l (h(xi),ui) h(xi) ui 2

R (h ) l h (x ), y dP (x , y )X Y

39

Empirical Risk MinimizationEmpirical Risk Minimization

• ERM principleERM principle Find h H minimizing empirical risk empirical risk

• Least error on training set

REmp (h) l h(xi ),ui i 1

m

Supervised Learning > 3) Learning > b) Inductive test

40

Learning curveLearning curve

• Data quantity is important!

Training set size

"error"

Learning curve

Supervised Learning > 3) Learning > b) Inductive test > Empirical risk

41

Test / ValidationTest / Validation

• Measures overfitting overfitting / generalizationgeneralization Acquired knowledge can be reused in new new

circumstancescircumstances? Do NOT validate over training set!

• Validation over additional test settest set

• Cross Validation Useful when few data leave-p-out

Supervised Learning > 3) Learning > b) Inductive test > Empirical risk

42

OverfittingOverfittingSupervised Learning > 3) Learning > b) Inductive test > Empirical risk

Real Risk

Emprirical Risk

Overfitting

Data quantity

43

RegularizationRegularization

• Limit overfitting before measuring it on test set

• Add penalizationpenalization in inductive test Ex:

• Penalize large number• Penalize resource use• …

Supervised Learning > 3) Learning > b) Inductive test > Empirical risk

44

Maximum a posterioriMaximum a posteriori

• Bayesian approach• We suppose there exists a priorprior probability distribution over

space H: pH(h)

Maximum A Posteriori principleMaximum A Posteriori principle (MAP)(MAP)::• Search for most probable h after observing data S

• Ex: Observation of sheep color h = "A sheep is white"

Supervised Learning > 3) Learning > b) Inductive test

45

Minimum Description Length PrincipleMinimum Description Length Principle

• Occam RazorOccam Razor"Prefer simplest hypotheses"

• Simplicity: size of h Maximum compressionMaximum compression

• Maximum a posteriori with pH(h) = 2-d(h)

• d(h): length in bits of h

• Compression generalization

Supervised Learning > 3) Learning > b) Inductive test

46

c) Running the learning programc) Running the learning program

• Search for h

• Use examples of training settraining set One by one All together

• Minimize inductive testinductive test

Supervised Learning > 3) Learning

47

Finding the parameters of the modelFinding the parameters of the model

• Explore hypothesis space H Best hypothesis given inductive test? Fundamentally depends on H

a) Structured exploration

b) Local exploration

c) No exploration

Supervised Learning > 3) Learning > c) Running the program

48

Structured explorationStructured exploration• Structured by generality relation Structured by generality relation

(partial order)(partial order) Version space ILP (Inductive Logic Programming) EBL (Explanation Based Learning) Grammatical inference Program enumeration

hi hj

gms(hi, hj)

smg(hi, hj)

H

Supervised Learning > 3) Learning > c) Running the program > Exploring H

49

Representation of the version spaceRepresentation of the version space

Structured by:

Upper bound: G-set

Lower bound: S-set

• G-set = Set of all most general hypotheses

consistent with known examples

• S-set = Set of all most specific hypotheses

consistent with known examples

H

G

S

hi hj

Supervised Learning > 3) Learning > c) Running the program > Exploring H

50

Learning…Learning…

… by iterated updates of the version space

Idea:

update S-set

and G-set

after each new example

Candidate elimination algorithm

Example: rectangles (cf. blackboard…)

Supervised Learning > 3-c) > Exploring H > Version space

51

Candidate Elimination algorithmCandidate Elimination algorithm

Initialize S (resp. G):

Set of most specific (resp. general), consistent with 1st example

For each new example (+ or -)

update S

update G

Until convergence

or until S = G = Ø

Supervised Learning > 3-c) > Exploring H > Version space

54

Updating S and G: xUpdating S and G: xii is is positivepositive

• Updating SUpdating S Generalize hypotheses in S not covering xi ,

just enough to cover it

Then eliminate hypotheses in S

• covering one or more negative example

• more general than another hypothesis in S

• Updating GUpdating G Eliminate hypotheses in G not covering xi

Supervised Learning > 3-c) > Exploring H > Version space

55

Updating S and G: xUpdating S and G: xii is is negativenegative

• Updating SUpdating S Eliminate hypotheses in S (wrongly) covering xi

• Updating GUpdating G Specialize hypotheses in G covering xi

just enough not to cover it

Then eliminate hypotheses in G

• not more general than at least one element of S

• more specific than at least another hypothesis of G

Supervised Learning > 3-c) > Exploring H > Version space

56

Candidate Elimitation AlgorithmCandidate Elimitation Algorithm

Updating S et G

H

G

Sx

x

x

x(a)

(b)

(c)

(d)

x

(d')

(b')

(a')

xx

x

Supervised Learning > 3-c) > Exploring H > Version space

57

Local explorationLocal exploration

• Only a Only a neighborhoodneighborhood notion in notion in HH "Gradient" methods

• Neural Networks• SVM• Simulated annealing / simulated evolution

• /!\ Local Minima

Supervised Learning > 3) Learning > c) Running the program > Exploring H

xh

H

58

Exploration without hypothesis spaceExploration without hypothesis space

• No hypothesis spaceNo hypothesis space Use examples directly

• and example space Nearest Neighbors methods

(Case Based Reasoning / Instance-based learning)

Notion of distancedistance

• Example: k Nearest Neighbors Optional: Vote weighted by distance

Supervised Learning > 3) Learning > c) Running the program > Exploring H

59

Inductive biaisInductive biais

• A priori preference of some hypotheses Depends on H Depends on search algorithm

• Whatever the inductive test: ERM: implicit in H MAP: explicit, user chooses MDL: explicit, (length in bits) PPV: distance notion

• What justification?

Supervised Learning

Supervised LearningSupervised Learning

Less frequent learning typesLess frequent learning types

61

Incremental LearningIncremental Learning

• Examples are given/taken one after the other Incremental update of best hypothesis Use acquired knowledge to

• learn better• learn faster

• Data is no more i.i.d. ! i.i.d: Independently and Identically distributed

= sampled uniformly from a non-changing example generator Dependence to time / sequence

• Ex: Mobile phone users preferences Learning to program …

Supervised Learning

62

Active LearningActive Learning

• Set of unlabeled examples Labeling an example is expansive

Choose an example to be labeled How to choose?

• Data is not i.i.d.

• Ex: video sequence labeling

Supervised Learning

Other types of Machine LearningOther types of Machine Learning

Reinforcement LearningReinforcement Learning

Unsupervised LearningUnsupervised Learning

64

Reinforcement LearningReinforcement Learning

• Pavlov Bell : trigger Dog bowl : reward Salivate : action Association

bell ↔ bowl Reinforcement of

"salivation"

ActionPerception

Environment

Reward /Punition

• Control behavior with rewards/punitions

65

Reinforcement LearningReinforcement Learning

• Agent must discover the right behavior And optimize itMaximize expected rewardexpected reward

st: state at time t

Action selection: at:= argmaxa Q(st, a)

• Updating valuesrt: reward received at time tQ(st, at) α Q(st, at) + (1- α) [ rt+1 + γ maxa Q(st+1, a) ]

66

Unsupervised LearningUnsupervised Learning

• No class, no output, no reward• Goal: group similar group similar examples together

• Notion of distance• Inductive bias

67

ConclusionConclusion

• Induction Find a general hypothesis from examples

• Avoid overfitting• Choose the right hypothesis space

Not too small (bad induction) Not too large (overfitting)

• Use an adequate algorithm With data With hypothesis space

68

What to rememberWhat to remember

• Mostly supervised learning is studied

• Learning is always biased

• Learning depends on the structure of the hypothesis space No structure: interpolation methods

Local structure: gradient methods (approximation)

Partial order relation: guided exploration (exploration)

Recommended