Introduction to Machine Learning Laurent Orseau AgroParisTech laurent.orseau@agroparistech.fr EFREI...

Introduction to Introduction to Machine LearningMachine Learning

Laurent OrseauAgroParisTech

laurent.orseau@agroparistech.fr

EFREI 2010-2011Based on slides by Antoine Cornuejols

OverviewOverview

• Introduction to Induction (Laurent Orseau)• Neural Networks• Support Vector Machines• Decision Trees • Introduction to Data-Mining (Christine Martin)• Association Rules• Clustering• Genetic Algorithms

Overview: IntroductionOverview: Introduction

• Introduction to Induction Examples of applications Learning types

• Supervised Learning• Reinforcement Learning• No-supervised Learning

Machine Learning Theory

• What questions to ask?

IntroductionIntroduction

What is Machine Learning ?What is Machine Learning ?• Memory

Knowledge acquisition Neurosciences

• Short-term (working) Keep 7±2 objects at a time

• Long-term Procedural

» Action sequences Declarative

» Semantic (concepts) » Episodic (facts)

• Learning Types By heart From rules By imitation / demonstration By trial & error

• Knowledge reuseKnowledge reuse In similar situations

Introduction

What is Machine Learning?What is Machine Learning?

• "The field of study that gives computers the ability to learn without being explicitly programmed "

Arthur Samuel, 1959

Samuel's Checkers> Schaeffer 2007 (solved)+ TD-Gammon, Tesauro 1992

Introduction

What is Machine Learning?What is Machine Learning?

Given:Experience E, A class of tasks T A performance measure P,

A computer is said to learn if

its performance on a task of T

measured by P

increases with experience E

Tom Mitchell, 1997

Introduction

Terms related to Machine LearningTerms related to Machine Learning

• Robotic Automatic Google Cars, Nao

• Prediction / forecasting Stock exchange, pollution peaks, …

• Recognition Face, language, writing, moves, …

• Optimization Subway speed, traveling salesman, …

• Regulation Heat, traffic, fridge temperature, …

• Autonomy Robots, hand prosthesis

• Automatic problem solving• Adaptation

User preferences, robot in changing environment• Induction• Generalization• Automatic discovery• …

Introduction

Some applicationsSome applications

Learning to cookLearning to cook

•Learning by imitation / demonstration•Procedural Learning (motor precision)•Object recognition

Applications

DARPA Grand challenge (2005)DARPA Grand challenge (2005)

Applications

200km of desert

Natural and artificial dangers

No driver

No remote control

200km of desert

Natural and artificial dangers

No driver

No remote control

Applications > DARPA Grand Challenge

5 Finalists5 Finalists

Recognition of the roadRecognition of the road

Learning to label images:Learning to label images:Face recognitionFace recognition

“Face Recognition: Component-based versus Global Approaches” (B. Heisele, P. Ho, J. Wu and T. Poggio), Computer Vision and Image Understanding, Vol. 91, No. 1/2, 6-21, 2003.

Applications

Applications > Reconnaissance d'images

Feature combinationsFeature combinations

Hand prosthesisHand prosthesis

• Recognition of pronator and supinator signals Imperfect sensors Noise Uncertainty

Applications

Autonomous robot rover on MarsAutonomous robot rover on Mars

Applications

Supervised Supervised LearningLearning

Learning by heart? UNEXPLOITABLE Generalize

How to encode forms?

Introduction to Introduction to Machine Learning TheoryMachine Learning Theory

Introduction to Machine Learning theoryIntroduction to Machine Learning theory

• Supervised Learning

• Reinforcement Learning

• Unsupervised Learning (CM)

• Genetic Algorithms (CM)

Supervised LearningSupervised Learning

• Set of examples xi labeled ui

• Find a hypothesis h so that:

h(xi) = ui ?

h(xi): predicted label

• Best hypothesis h* ?

Supervised Learning: 1Supervised Learning: 1stst Example Example

• Houses: Price / m²

• Searching for h Nearest neighbors? Linear, polynomial regression?

• More information Localization (x, y ? or symbolic variable?),

age of building, neighborhood, swimming-pool, local taxes, temporal evolution,…?

Supervised Learning

Problem Problem

Prediction du prix du m² pour une maison donnee.

1) Modeling

2) Data gathering

3) Learning

4) Validation

5) Use in real case

Supervised Learning

Ideal Practice

1) Modeling1) Modeling

• Input space What is the meaningful information? Variables

• Output space What is to be predicted?

• Hypothesis space Input –(computation) Output What (kind of) computation?

Supervised Learning

1-a) Input space: Variables1-a) Input space: Variables

• What is the meaningful information?• Should we get as much as possible?• Information quality?

Noise Quantity

• Cost of information gathering? Economic Time Risk (invasive?) Ethic Law (CNIL)

• Definition domain of each variable? Symbolic, bounded numeric, not bounded, etc.

Supervised Learning > 1) Modeling

Price of m²: VariablesPrice of m²: Variables

• Localization Continuous: (x, y) longitude latitude ? Symbolic: city name?

• Age of building Year of creation? Relative to present or to creation date?

• Nature of soil

• Swimming-pool?

Supervised Learning > 1) Modeling > a) Variables

1-b) Output space1-b) Output space

• What do we want on output? Symbolic classes? (classification)

• Boolean Yes/No (concept learning)• Multi-valued A/B/C/D/…

Numeric? (regression)• [0 ; 1] ?• [-∞ ; +∞] ?

• How many outputs? Multi-valued Multi-class ?

• 1 output for each class Learn a model for each output?

• More "free" Learn 1 model for all outputs?

• Each model can use others' information

1-c) Hypothesis space1-c) Hypothesis space

• Critical!

• Depends on the learning algorithm Linear Regression: space = ax + b

• Parameters: a and b Polynomial regression

• # parameters = polynomial degree• Neural Networks, SVM, Gen Algo, …

Choice of hypothesis spaceChoice of hypothesis space

Estimation

Total ErrorApproximation Error

Choice of hypothesis spaceChoice of hypothesis space

• Space too "poor" Inadequate solutions Ex: model sin(x) with y=ax+b

• Space too "rich" risk of overfittingoverfitting• Defined by set of parametersparameters

High # params learning more difficult

• But prefer a richer hypothesis space! Use of generic methods Add regularization

Supervised Learning > 1) Modeling > c) Hypothesis space

2) Data gathering2) Data gathering

• Gathering Electronic sensors Simulation Polls Automated on the Internet …

• Get highest quantity of data Collect cost

• Data as "pure" as possible Avoid all noise

• Noise in variables• Noise in labels!

1 example = 1 value for each variable• missing value = useless example?

Supervised Learning

Gathered dataGathered data

x1x1 x2x2 x3x3 uu

Example 1 Yes 1.5 Green -

Example 2 No 1.4 Orange +

Example 3 Yes 3.7 Orange -

… … … … …

Inputs / Variables

measured

Output /

Class /

But true label y

unreachable !

Supervised Learning > 2) Data gathering

Data preprocessingData preprocessing

• Clean up data ex: Reduce background noise

• Transform data Final format adapted to task Ex: Fourier Transform of radio signal

time/amplitude frequency/amplitude

Supervised Learning > 2) Data gathering

3) Learning3) Learning

a) Choice of program parameters

b) Choice of inductive test

c) Running the learning program

d) Performance test

If bad, return to a)…

Supervised Learning

a) Choice of program parametersa) Choice of program parameters

• Max allocated computation time

• Max accepted error

• Learning parameters Specific to model

• Knowledge introduction Initialize parameters to "ok" values?

• …

Supervised Learning > 3) Learning

b) Choice of inductive testb) Choice of inductive test

Goal: find hypothesis h H minimizing real riskreal risk (risk expectancy, generalization error)

predictedlabel

true label y(or desired u)

Loss functionLoss functionJoint probability law

over X Y

R(h) l h(x),y dP(x, y)XY

Real riskReal risk

• Goal: Minimize real risk

• Real risk is not known, in particular P(X,Y).

Supervised Learning > 3) Learning > b) Inductive test

• Discrimination

• Regression

l (h(xi),ui) 0 si ui h(xi )

1 si ui h(xi )

l (h(xi),ui) h(xi) ui 2

R (h ) l h (x ), y dP (x , y )X Y

Empirical Risk MinimizationEmpirical Risk Minimization

• ERM principleERM principle Find h H minimizing empirical risk empirical risk

• Least error on training set

REmp (h) l h(xi ),ui i 1

Learning curveLearning curve

• Data quantity is important!

Training set size

"error"

Learning curve

Supervised Learning > 3) Learning > b) Inductive test > Empirical risk

Test / ValidationTest / Validation

• Measures overfitting overfitting / generalizationgeneralization Acquired knowledge can be reused in new new

circumstancescircumstances? Do NOT validate over training set!

• Validation over additional test settest set

• Cross Validation Useful when few data leave-p-out

OverfittingOverfittingSupervised Learning > 3) Learning > b) Inductive test > Empirical risk

Real Risk

Emprirical Risk

Overfitting

Data quantity

RegularizationRegularization

• Limit overfitting before measuring it on test set

• Add penalizationpenalization in inductive test Ex:

• Penalize large number• Penalize resource use• …

Maximum a posterioriMaximum a posteriori

• Bayesian approach• We suppose there exists a priorprior probability distribution over

space H: pH(h)

Maximum A Posteriori principleMaximum A Posteriori principle (MAP)(MAP)::• Search for most probable h after observing data S

• Ex: Observation of sheep color h = "A sheep is white"

Minimum Description Length PrincipleMinimum Description Length Principle

• Occam RazorOccam Razor"Prefer simplest hypotheses"

• Simplicity: size of h Maximum compressionMaximum compression

• Maximum a posteriori with pH(h) = 2-d(h)

• d(h): length in bits of h

• Compression generalization

c) Running the learning programc) Running the learning program

• Search for h

• Use examples of training settraining set One by one All together

• Minimize inductive testinductive test

Finding the parameters of the modelFinding the parameters of the model

• Explore hypothesis space H Best hypothesis given inductive test? Fundamentally depends on H

a) Structured exploration

b) Local exploration

c) No exploration

Supervised Learning > 3) Learning > c) Running the program

Structured explorationStructured exploration• Structured by generality relation Structured by generality relation

(partial order)(partial order) Version space ILP (Inductive Logic Programming) EBL (Explanation Based Learning) Grammatical inference Program enumeration

gms(hi, hj)

smg(hi, hj)

Supervised Learning > 3) Learning > c) Running the program > Exploring H

Representation of the version spaceRepresentation of the version space

Structured by:

Upper bound: G-set

Lower bound: S-set

• G-set = Set of all most general hypotheses

consistent with known examples

• S-set = Set of all most specific hypotheses

consistent with known examples

Learning…Learning…

… by iterated updates of the version space

update S-set

and G-set

after each new example

Candidate elimination algorithm

Example: rectangles (cf. blackboard…)

Supervised Learning > 3-c) > Exploring H > Version space

Candidate Elimination algorithmCandidate Elimination algorithm

Initialize S (resp. G):

Set of most specific (resp. general), consistent with 1st example

For each new example (+ or -)

update S

update G

Until convergence

or until S = G = Ø

Updating S and G: xUpdating S and G: xii is is positivepositive

• Updating SUpdating S Generalize hypotheses in S not covering xi ,

just enough to cover it

Then eliminate hypotheses in S

• covering one or more negative example

• more general than another hypothesis in S

• Updating GUpdating G Eliminate hypotheses in G not covering xi

Updating S and G: xUpdating S and G: xii is is negativenegative

• Updating SUpdating S Eliminate hypotheses in S (wrongly) covering xi

• Updating GUpdating G Specialize hypotheses in G covering xi

just enough not to cover it

Then eliminate hypotheses in G

• not more general than at least one element of S

• more specific than at least another hypothesis of G

Candidate Elimitation AlgorithmCandidate Elimitation Algorithm

Updating S et G

Local explorationLocal exploration

• Only a Only a neighborhoodneighborhood notion in notion in HH "Gradient" methods

• Neural Networks• SVM• Simulated annealing / simulated evolution

• /!\ Local Minima

Exploration without hypothesis spaceExploration without hypothesis space

• No hypothesis spaceNo hypothesis space Use examples directly

• and example space Nearest Neighbors methods

(Case Based Reasoning / Instance-based learning)

Notion of distancedistance

• Example: k Nearest Neighbors Optional: Vote weighted by distance

Inductive biaisInductive biais

• A priori preference of some hypotheses Depends on H Depends on search algorithm

• Whatever the inductive test: ERM: implicit in H MAP: explicit, user chooses MDL: explicit, (length in bits) PPV: distance notion

• What justification?

Supervised Learning

Supervised LearningSupervised Learning

Less frequent learning typesLess frequent learning types

Incremental LearningIncremental Learning

• Examples are given/taken one after the other Incremental update of best hypothesis Use acquired knowledge to

• learn better• learn faster

• Data is no more i.i.d. ! i.i.d: Independently and Identically distributed

= sampled uniformly from a non-changing example generator Dependence to time / sequence

• Ex: Mobile phone users preferences Learning to program …

Supervised Learning

Active LearningActive Learning

• Set of unlabeled examples Labeling an example is expansive

Choose an example to be labeled How to choose?

• Data is not i.i.d.

• Ex: video sequence labeling

Supervised Learning

Other types of Machine LearningOther types of Machine Learning

Reinforcement LearningReinforcement Learning

Unsupervised LearningUnsupervised Learning

• Pavlov Bell : trigger Dog bowl : reward Salivate : action Association

bell ↔ bowl Reinforcement of

"salivation"

ActionPerception

Environment

Reward /Punition

• Control behavior with rewards/punitions

• Agent must discover the right behavior And optimize itMaximize expected rewardexpected reward

st: state at time t

Action selection: at:= argmaxa Q(st, a)

• Updating valuesrt: reward received at time tQ(st, at) α Q(st, at) + (1- α) [ rt+1 + γ maxa Q(st+1, a) ]

Unsupervised LearningUnsupervised Learning

• No class, no output, no reward• Goal: group similar group similar examples together

• Notion of distance• Inductive bias

ConclusionConclusion

• Induction Find a general hypothesis from examples

• Avoid overfitting• Choose the right hypothesis space

Not too small (bad induction) Not too large (overfitting)

• Use an adequate algorithm With data With hypothesis space

What to rememberWhat to remember

• Mostly supervised learning is studied

• Learning is always biased

• Learning depends on the structure of the hypothesis space No structure: interpolation methods

Local structure: gradient methods (approximation)

Partial order relation: guided exploration (exploration)

Introduction to Machine Learning Laurent Orseau AgroParisTech laurent.orseau@agroparistech.fr EFREI...

Documents

1 L. Orseau Induction of decision trees Induction of Decision Trees Laurent Orseau (laurent.orseau@agroparistech.fr) AgroParisTech based on slides by Antoine

1 L. Orseau Neural Networks Neural Networks EFREI 2010 Laurent Orseau (laurent.orseau@agroparistech.fr) AgroParisTech based on slides by Antoine Cornuejols

Michele Conforti Gérard Cornuéjols Giacomo Zambelli ...solab.kaist.ac.kr/files/IP/IP2017/2014_Integer Prog_Conforti-Cornuejols-Zambelli.pdfPreface Integer programming is a thriving

Bureautique Excel-VBA et Access – Cours 3 Dominante Économie et Gestion dEntreprise AgroParisTech Année 2008-2009 Liliana IBANESCU et Laurent ORSEAU UFR

Cornuejols Tutuncu 2007- Optimization Methods in Finance

Chapter 10 Disjunctive Programming - LARA: Wikilara.epfl.ch/w/_media/projects:disjunctive_programming.pdf · Chapter 10 Disjunctive Programming ... G. Cornuejols and F. Margot,

Optimization Methods in Finance - kurolf/CT_FinOpt.pdf · Optimization Methods in Finance Gerard Cornuejols Reha Tut unc u ... January 2006. 2 Foreword Optimization models play an

The Principle of Presence: A Heuristic for Growing Knowledge Structured Neural Networks Laurent Orseau, INSA/IRISA, Rennes, France

arXiv:1705.08417v2 [cs.AI] 19 Aug 2017 · Yampolskiy, 2014], also known as self-delusion [Ring and Orseau, 2011], and reward hacking [Hutter, 2005, p. 239]. A notable exception is

1 L. Orseau Les réseaux connexionnistes Les réseaux connexionnistes EFREI 2010 Laurent Orseau (laurent.orseau@agroparistech.fr) AgroParisTech d'après les