Upload
rohit-mathur
View
283
Download
0
Embed Size (px)
Citation preview
UNIT- 4 Machine Learning
What is Machine Learning?
Adapt to / learn from dataTo optimize a performance function
Can be used to:Extract knowledge from dataLearn tasks that are difficult to formaliseCreate software that improves over time
When to learn Human expertise does not exist (navigating on Mars) Humans are unable to explain their expertise (speech
recognition) Solution changes in time (routing on a computer network) Solution needs to be adapted to particular cases (user
biometrics)
Learning involves Learning general models from data Data is cheap and abundant. Knowledge is expensive and scarce Customer transactions to computer behaviour Build a model that is a good and useful approximation to the data
Applications Speech and hand-writing recognition Autonomous robot control Data mining and bioinformatics: motifs, alignment, … Playing games Fault detection Clinical diagnosis Spam email detection Credit scoring, fraud detection Web mining: search engines Market basket analysis,
Applications are diverse but methods are generic
Generic methods
Learning from labelled data (supervised learning)Eg. Classification, regression, prediction, function approx.
Learning from unlabelled data (unsupervised learning)Eg. Clustering, visualisation, dimensionality reduction
Learning from sequential dataEg. Speech recognition, DNA data analysisAssociationsReinforcement Learning
Statistical Learning
Machine learning methods can be unified within the framework of statistical learning:Data is considered to be a sample from a probability
distribution.Typically, we don’t expect perfect learning but only
“probably correct” learning.Statistical concepts are the key to measuring our expected
performance on novel problem instances.
Induction and inference
Induction: Generalizing from specific examples.
Inference: Drawing conclusions from possibly incomplete knowledge.
Learning machines need to do both.
Inductive learning
Data produced by “target”. Hypothesis learned from data in order to “explain”, “predict”,“model”
or “control” target. Generalisation ability is essential.
Inductive learning hypothesis: “If the hypothesis works for enough data
then it will work on new examples.”
Example 1: Hand-written digits
Data representation: Greyscale imagesTask: Classification (0,1,2,3…..9)Problem features: Highly variable inputs from same class including some
“weird” inputs, imperfect human classification,high cost associated with errors so “don’t know” may be
useful.
Example 2: Speech recognition
Data representation: features from spectral analysis of speech signals (two in this simple example).
Problem features: Highly variable data with same classification.Good feature selection is very important.Speech recognition is often broken into a number of
smaller tasks like this.
Example 3: DNA microarrays
DNA from ~10000 genes attached to a glass slide (the microarray).
Green and red labels attached to mRNA from two different samples.
mRNA is hybridized (stuck) to the DNA on the chip and green/red ratio is used to measure relative abundance of gene products.
DNA microarrays
Data representation: ~10000 Green/red intensity levels ranging from 10-10000.
Tasks: Sample classification, gene classification, visualisation and clustering of genes/samples.
Problem features: High-dimensional data but relatively small number of examples. Extremely noisy data (noise ~ signal). Lack of good domain knowledge.
Projection of 10000 dimensional data onto 2D using PCA effectively separates cancer subtypes.
Probabilistic models
A large part of the module will deal with methodsthat have an explicit probabilistic interpretation:
Good for dealing with uncertaintyeg. is a handwritten digit a three or an eight ?Provides interpretable resultsUnifies methods from different fields
20 of 15
Face Detection1. Image pyramid used to locate faces of different sizes2. Image lighting compensation3. Neural Network detects rotation of face candidate4. Final face candidate de-rotated ready for detection
21 of 15
Face Detection (Con’t)
5. Submit image to Neural Networka. Break image into segmentsb. Each segment is a unique input to the networkc. Each segment looks for certain patterns (eyes,
mouth, etc)6. Output is likelihood of a face
Supervised Learning: Uses
Prediction of future casesKnowledge extractionCompression of Data & knowledge
Unsupervised Learning
Clustering: grouping similar instancesExample applications
Customer segmentation in CRMLearning pattern in bioinformaticsClustering items based on similarityClustering users based on interests
Reinforcement Learning
Learning a policy: A sequence of outputsNo supervised output but delayed rewardCredit assignment problemGame playingRobot in a mazeMultiple agnts, partial observability
ID3 Decision Tree
It is particularly interesting forIts representation of learned knowledgeIts approach to the management of complexity Its heuristic for selecting candidate conceptsIts potential for handling noisy data
ID3 Decision Tree
ID3 Decision Tree
The previous table can be represented as the following decision tree:
ID3 Decision Tree
In a decision tree, each internal node represents a test on some property Each possible value of that property corresponds to a branch of the tree Leaf nodes represents classification, such as low or moderate risk
ID3 Decision Tree
A simplified decision tree for credit risk management
ID3 Decision Tree
ID3 constructs decision trees in a top-down fashion.
ID3 selects a property to test at the current node of the tree and uses this test to partition the set of examples
The algorithm recursively constructs a sub-tree for each parturition
This continues until all members of the partition are in the same class
ID3 Decision Tree
For example, ID3 selects income as the root property for the first step
ID3 Decision Tree
ID3 Decision Tree
How to select the 1st node? (and the following nodes)
ID3 measures the information gained by making each property the root of current subtree
It picks the property that provides the greatest information gain
ID3 Decision Tree
If we assume that all the examples in the table occur with equal probability, then:P(risk is high)=6/14P(risk is moderate)=3/14P(risk is low)=5/14
ID3 Decision Tree
I[6,3,5]=
Based on
531.1)145(log
145)
143(log
143)
146(log
146)5,3,6()( 222 IDInfo
n
iii mpmpMI
12 ))((log)()(
ID3 Decision Tree
The information gain form income is:Gain(income)= I[6,3,5]-E[income]= 1.531-0.564=0.967
Similarly, Gain(credit history)=0.266 Gain(debt)=0.063 Gain(colletral)=0.206
ID3 Decision Tree
Since income provides the greatest information gain, ID3 will select it as the root of the tree
ID3 Decision Tree Pseudo Code
Unsupervised Learning
The learning algorithms discussed so far implement forms of supervised learning
They assume the existence of a teacher, some fitness measure, or other external method of classifying training instances
Unsupervised Learning eliminates the teacher and requires that the learners form and evaluate concepts their own
Unsupervised Learning
Science is perhaps the best example of unsupervised learning in humans
Scientists do not have the benefit of a teacher. Instead, they propose hypotheses to explain
observations,
Unsupervised Learning
The result of this algorithm is a Binary Tree whose leaf nodes are instances and whose internal nodes are clusters of increasing size
We may also extend this algorithm to objects represented as sets of symbolic features.
Unsupervised Learning
Object1={small, red, rubber, ball}Object1={small, blue, rubber, ball}Object1={large, black, wooden, ball}
This metric would compute the similary values:Similarity(object1, object2)= ¾Similarity(object1, object3)=1/4
Machine Learning
Up till now: how to search or reason using a model
Machine learning: how to select a model on the basis of data / experienceLearning parameters (e.g. probabilities)Learning hidden concepts (e.g. clustering)
Classification
In classification, we learn to predict labels (classes) for inputs
Examples: Spam detection (input: document, classes: spam / ham) OCR (input: images, classes: characters) Medical diagnosis (input: symptoms, classes: diseases) Automatic essay grader (input: document, classes: grades) Fraud detection (input: account activity, classes: fraud / no fraud) Customer service email routing … many more
Classification is an important commercial technology!
Classification
Data: Inputs x, class labels y We imagine that x is something that has a lot of structure, like an
image or document In the basic case, y is a simple N-way choice
Basic Setup: Training data: D = bunch of <x,y> pairs Feature extractors: functions fi which provide attributes of an
example x Test data: more x’s, we must predict y’s
Bayes Nets for Classification
One method of classification:Features are values for observed variablesY is a query variableUse probabilistic inference to compute most likely Y
Simple Classification
Simple example: two binary featuresThis is a naïve Bayes model
M
S F
direct estimate
Bayes estimate (no assumptions)
Conditional independence
+
General Naïve Bayes
A general naive Bayes model:
C
E1 EnE2
|C| parameters n x |E| x |C| parameters
|C| x |E|n parameters
Inference for Naïve Bayes
Goal: compute posterior over causes Step 1: get joint probability of causes and evidence
Step 2: get probability of evidence
Step 3: renormalize
+
A Digit Recognizer
Input: pixel grids
Output: a digit 0-9
Examples: CPTs
1 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.10 0.1
1 0.012 0.053 0.054 0.305 0.806 0.907 0.058 0.609 0.500 0.80
1 0.052 0.013 0.904 0.805 0.906 0.907 0.258 0.859 0.600 0.80
Parameter EstimationEstimating the distribution of a random variable X or X|Y
Empirically: use training data For each value x, look at the empirical rate of that value:
This estimate maximizes the likelihood of the data
Elicitation: ask a human! Usually need domain experts, and sophisticated ways of eliciting
probabilities (e.g. betting games) Trouble calibrating
r g g
Handwritten characters classification
Gray level pictures:object classification
Gray level pictures: human action classification
Expectation Maximization EM
when to usedata is only partially observableunsupervised clustering: target value unobservablesupervised learning: some instance attributes
unobservableapplications training Bayesian Belief Networksunsupervised clustering learning hidden Markov models
Generating Data from Mixture of Gaussians
Each instance x generated by choosing one of the k Gaussians at randomGenerating an instance according to that Gaussian
EM for Estimating k Means
Given: instances from X generated by mixture of k Gaussiansunknown means <m1,…,mk> of the k Gaussiansdon’t know which instance xi was generated by which
GaussianDetermine:maximum likelihood estimates of <m1,…,mk> Think of full description of each instance as yi=<xi,zi1,zi2>zij is 1 if xi generated by j-th Gaussianxi observablezij unobservable
EM Algorithm
Converges to local maximum likelihood and provides estimates of hidden variables zij.
In fact local maximum in E [ln (P(Y|h)]Y is complete (observable plus non-observable
variables) dataExpected valued is taken over possible values of
unobserved variables in Y
General EM Problem
Given:observed data X = {x1,…,xm}unobserved data Z = {z1,…,zm}parameterized probability distribution P(Y|h) where
Y = {y1,…,ym} is the full data yi=<xi,zi>h are the parameters
Determine:h that (locally) maximizes E[ln P(Y|h)]Applications: train Bayesian Belief Networksunsupervised clusteringhidden Markov models
General EM Method
Define likelihood function Q(h’|h) which calculates Y = X Z using observed X and current parameters h
to estimate Z Q(h’|h) = E[ ln( P(Y|h’) | h, X] EM algorithm:Estimation (E) step: Calculate Q(h’|h) using the current
hypothesis h and the observed data X to estimate the probability distribution over Y.
Q(h’|h) = E[ ln( P(Y|h’) | h, X] Maximization (M) step: Replace hypothesis h by the
hypothesis h’ that maximizes this Q function. h = argmaxh’H Q(h’|h)
Thank You