Machine learning

UNIT- 4 Machine Learning

What is Machine Learning?

Adapt to / learn from dataTo optimize a performance function

Can be used to:Extract knowledge from dataLearn tasks that are difficult to formaliseCreate software that improves over time

When to learn Human expertise does not exist (navigating on Mars) Humans are unable to explain their expertise (speech

recognition) Solution changes in time (routing on a computer network) Solution needs to be adapted to particular cases (user

biometrics)

Learning involves Learning general models from data Data is cheap and abundant. Knowledge is expensive and scarce Customer transactions to computer behaviour Build a model that is a good and useful approximation to the data

Applications Speech and hand-writing recognition Autonomous robot control Data mining and bioinformatics: motifs, alignment, … Playing games Fault detection Clinical diagnosis Spam email detection Credit scoring, fraud detection Web mining: search engines Market basket analysis,

Applications are diverse but methods are generic

Generic methods

Learning from labelled data (supervised learning)Eg. Classification, regression, prediction, function approx.

Learning from unlabelled data (unsupervised learning)Eg. Clustering, visualisation, dimensionality reduction

Learning from sequential dataEg. Speech recognition, DNA data analysisAssociationsReinforcement Learning

Statistical Learning

Machine learning methods can be unified within the framework of statistical learning:Data is considered to be a sample from a probability

distribution.Typically, we don’t expect perfect learning but only

“probably correct” learning.Statistical concepts are the key to measuring our expected

performance on novel problem instances.

Induction and inference

Induction: Generalizing from specific examples.

Inference: Drawing conclusions from possibly incomplete knowledge.

Learning machines need to do both.

Inductive learning

Data produced by “target”. Hypothesis learned from data in order to “explain”, “predict”,“model”

or “control” target. Generalisation ability is essential.

Inductive learning hypothesis: “If the hypothesis works for enough data

then it will work on new examples.”

Example 1: Hand-written digits

Data representation: Greyscale imagesTask: Classification (0,1,2,3…..9)Problem features: Highly variable inputs from same class including some

“weird” inputs, imperfect human classification,high cost associated with errors so “don’t know” may be

useful.

Example 2: Speech recognition

Data representation: features from spectral analysis of speech signals (two in this simple example).

Problem features: Highly variable data with same classification.Good feature selection is very important.Speech recognition is often broken into a number of

smaller tasks like this.

Example 3: DNA microarrays

DNA from ~10000 genes attached to a glass slide (the microarray).

Green and red labels attached to mRNA from two different samples.

mRNA is hybridized (stuck) to the DNA on the chip and green/red ratio is used to measure relative abundance of gene products.

DNA microarrays

Data representation: ~10000 Green/red intensity levels ranging from 10-10000.

Tasks: Sample classification, gene classification, visualisation and clustering of genes/samples.

Problem features: High-dimensional data but relatively small number of examples. Extremely noisy data (noise ~ signal). Lack of good domain knowledge.

Projection of 10000 dimensional data onto 2D using PCA effectively separates cancer subtypes.

Probabilistic models

A large part of the module will deal with methodsthat have an explicit probabilistic interpretation:

Good for dealing with uncertaintyeg. is a handwritten digit a three or an eight ?Provides interpretable resultsUnifies methods from different fields

20 of 15

Face Detection1. Image pyramid used to locate faces of different sizes2. Image lighting compensation3. Neural Network detects rotation of face candidate4. Final face candidate de-rotated ready for detection

21 of 15

Face Detection (Con’t)

5. Submit image to Neural Networka. Break image into segmentsb. Each segment is a unique input to the networkc. Each segment looks for certain patterns (eyes,

mouth, etc)6. Output is likelihood of a face

Supervised Learning: Uses

Prediction of future casesKnowledge extractionCompression of Data & knowledge

Unsupervised Learning

Clustering: grouping similar instancesExample applications

Customer segmentation in CRMLearning pattern in bioinformaticsClustering items based on similarityClustering users based on interests

Reinforcement Learning

Learning a policy: A sequence of outputsNo supervised output but delayed rewardCredit assignment problemGame playingRobot in a mazeMultiple agnts, partial observability

ID3 Decision Tree

It is particularly interesting forIts representation of learned knowledgeIts approach to the management of complexity Its heuristic for selecting candidate conceptsIts potential for handling noisy data

ID3 Decision Tree

ID3 Decision Tree

The previous table can be represented as the following decision tree:

ID3 Decision Tree

In a decision tree, each internal node represents a test on some property Each possible value of that property corresponds to a branch of the tree Leaf nodes represents classification, such as low or moderate risk

ID3 Decision Tree

A simplified decision tree for credit risk management

ID3 Decision Tree

ID3 constructs decision trees in a top-down fashion.

ID3 selects a property to test at the current node of the tree and uses this test to partition the set of examples

The algorithm recursively constructs a sub-tree for each parturition

This continues until all members of the partition are in the same class

ID3 Decision Tree

For example, ID3 selects income as the root property for the first step

ID3 Decision Tree

ID3 Decision Tree

How to select the 1st node? (and the following nodes)

ID3 measures the information gained by making each property the root of current subtree

It picks the property that provides the greatest information gain

ID3 Decision Tree

If we assume that all the examples in the table occur with equal probability, then:P(risk is high)=6/14P(risk is moderate)=3/14P(risk is low)=5/14

ID3 Decision Tree

I[6,3,5]=

Based on

531.1)145(log

145)

143(log

143)

146(log

146)5,3,6()( 222 IDInfo

n

iii mpmpMI

12 ))((log)()(

ID3 Decision Tree

The information gain form income is:Gain(income)= I[6,3,5]-E[income]= 1.531-0.564=0.967

Similarly, Gain(credit history)=0.266 Gain(debt)=0.063 Gain(colletral)=0.206

ID3 Decision Tree

Since income provides the greatest information gain, ID3 will select it as the root of the tree

ID3 Decision Tree Pseudo Code


The learning algorithms discussed so far implement forms of supervised learning

They assume the existence of a teacher, some fitness measure, or other external method of classifying training instances

Unsupervised Learning eliminates the teacher and requires that the learners form and evaluate concepts their own


Science is perhaps the best example of unsupervised learning in humans

Scientists do not have the benefit of a teacher. Instead, they propose hypotheses to explain

observations,


The result of this algorithm is a Binary Tree whose leaf nodes are instances and whose internal nodes are clusters of increasing size

We may also extend this algorithm to objects represented as sets of symbolic features.


Object1={small, red, rubber, ball}Object1={small, blue, rubber, ball}Object1={large, black, wooden, ball}

This metric would compute the similary values:Similarity(object1, object2)= ¾Similarity(object1, object3)=1/4

Machine Learning

Up till now: how to search or reason using a model

Machine learning: how to select a model on the basis of data / experienceLearning parameters (e.g. probabilities)Learning hidden concepts (e.g. clustering)

Classification

In classification, we learn to predict labels (classes) for inputs

Examples: Spam detection (input: document, classes: spam / ham) OCR (input: images, classes: characters) Medical diagnosis (input: symptoms, classes: diseases) Automatic essay grader (input: document, classes: grades) Fraud detection (input: account activity, classes: fraud / no fraud) Customer service email routing … many more

Classification is an important commercial technology!

Classification

Data: Inputs x, class labels y We imagine that x is something that has a lot of structure, like an

image or document In the basic case, y is a simple N-way choice

Basic Setup: Training data: D = bunch of <x,y> pairs Feature extractors: functions fi which provide attributes of an

example x Test data: more x’s, we must predict y’s

Bayes Nets for Classification

One method of classification:Features are values for observed variablesY is a query variableUse probabilistic inference to compute most likely Y

Simple Classification

Simple example: two binary featuresThis is a naïve Bayes model

M

S F

direct estimate

Bayes estimate (no assumptions)

Conditional independence

+

General Naïve Bayes

A general naive Bayes model:

C

E1 EnE2

|C| parameters n x |E| x |C| parameters

|C| x |E|n parameters

Inference for Naïve Bayes

Goal: compute posterior over causes Step 1: get joint probability of causes and evidence

Step 2: get probability of evidence

Step 3: renormalize

+

A Digit Recognizer

Input: pixel grids

Output: a digit 0-9

Examples: CPTs

1 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.10 0.1

1 0.012 0.053 0.054 0.305 0.806 0.907 0.058 0.609 0.500 0.80

1 0.052 0.013 0.904 0.805 0.906 0.907 0.258 0.859 0.600 0.80

Parameter EstimationEstimating the distribution of a random variable X or X|Y

Empirically: use training data For each value x, look at the empirical rate of that value:

This estimate maximizes the likelihood of the data

Elicitation: ask a human! Usually need domain experts, and sophisticated ways of eliciting

probabilities (e.g. betting games) Trouble calibrating

r g g

Handwritten characters classification

Gray level pictures:object classification

Gray level pictures: human action classification

Expectation Maximization EM

when to usedata is only partially observableunsupervised clustering: target value unobservablesupervised learning: some instance attributes

unobservableapplications training Bayesian Belief Networksunsupervised clustering learning hidden Markov models

Generating Data from Mixture of Gaussians

Each instance x generated by choosing one of the k Gaussians at randomGenerating an instance according to that Gaussian

EM for Estimating k Means

Given: instances from X generated by mixture of k Gaussiansunknown means <m1,…,mk> of the k Gaussiansdon’t know which instance xi was generated by which

GaussianDetermine:maximum likelihood estimates of <m1,…,mk> Think of full description of each instance as yi=<xi,zi1,zi2>zij is 1 if xi generated by j-th Gaussianxi observablezij unobservable

EM Algorithm

Converges to local maximum likelihood and provides estimates of hidden variables zij.

In fact local maximum in E [ln (P(Y|h)]Y is complete (observable plus non-observable

variables) dataExpected valued is taken over possible values of

unobserved variables in Y

General EM Problem

Given:observed data X = {x1,…,xm}unobserved data Z = {z1,…,zm}parameterized probability distribution P(Y|h) where

Y = {y1,…,ym} is the full data yi=<xi,zi>h are the parameters

Determine:h that (locally) maximizes E[ln P(Y|h)]Applications: train Bayesian Belief Networksunsupervised clusteringhidden Markov models

General EM Method

Define likelihood function Q(h’|h) which calculates Y = X Z using observed X and current parameters h

to estimate Z Q(h’|h) = E[ ln( P(Y|h’) | h, X] EM algorithm:Estimation (E) step: Calculate Q(h’|h) using the current

hypothesis h and the observed data X to estimate the probability distribution over Y.

Q(h’|h) = E[ ln( P(Y|h’) | h, X] Maximization (M) step: Replace hypothesis h by the

hypothesis h’ that maximizes this Q function. h = argmaxh’H Q(h’|h)

Thank You

Education

Machine learning