Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Machine Learning
Dr. G. Bharadwaja Kumar
VIT Chennai
Important developments in Computer Science
1. Bio Metrics: Speaker verification, Face, iris, finger print
2. Finance: Credit scoring, fraud detection
3. Manufacturing: Optimization, troubleshooting
4. Medicine: Clinical diagnosis
5. Telecommunications: Quality of service optimization
6. Stock market forecasting
7. Hand written character recognition
8. Autonomous robot control
9. Spam email detection
10. ...
Dr. G. Bharadwaja Kumar, VIT Chennai 2
Dr. G. Bharadwaja Kumar, VIT Chennai 3
Watson - Jeopardy
Dr. G. Bharadwaja Kumar, VIT Chennai 4
Dr. G. Bharadwaja Kumar, VIT Chennai 5
Dr. G. Bharadwaja Kumar, VIT Chennai 6
Speech Recognition
Dr. G. Bharadwaja Kumar, VIT Chennai 7Apple Smart Watch with SIRI Apple Smart Phone with SIRI
Dr. G. Bharadwaja Kumar, VIT Chennai 8
Medical Diagnosis
• Assist in decision making with a large number of inputs and in stressful situations
Dr. G. Bharadwaja Kumar, VIT Chennai 9
Medical Diagnosis
Dr. G. Bharadwaja Kumar, VIT Chennai 10
Dr. G. Bharadwaja Kumar, VIT Chennai 11
12
ML Application: Loan Approvals
income debt married age approve deny
John Smith 200,000 0 yes 80
Peter White 60,000 1,000 no 30
Ann Clark 100,000 10,000 yes 40
Susan Ho 0 20,000 no 25
• Objects – people
• Classes – “approve”, “deny”
Biometrics
Dr. G. Bharadwaja Kumar, VIT Chennai 13
Dr. G. Bharadwaja Kumar, VIT Chennai 14
Dr. G. Bharadwaja Kumar, VIT Chennai 15
Dr. G. Bharadwaja Kumar, VIT Chennai 16
What is Artificial Intelligence ?
• Branch of computer science which • is the science of making machines do things that would require
intelligence if done by men (Minsky)
• is the exciting new effort to make computers think (Haugeland)
• is the study of the computations that make it possible to perceive, reason, and act (Winston)
• is the study of how to do things which at the moment people do better (Rich & Knight)
• “Artificial Intelligence” term was coined in 1956 by John McCarthy at MIT
Dr. G. Bharadwaja Kumar, VIT Chennai 17
• AI is an extensive field of Computer Science
• There are many sub-fields of AI• Machine Learning
• Natural Language processing
• Speech Recognition
• Computer Vision
Dr. G. Bharadwaja Kumar, VIT Chennai 18
Machine Learning : Definition
• The goal of machine learning is to build computer systems that can adapt and learn from example data and past experience & optimize their performance.”
• Tom Dietterich
Dr. G. Bharadwaja Kumar, VIT Chennai 19
Other Definitions
❖ Machine learning expedite computers to learn and improveautomatically using example data or past experience and enablecomputers to handle new situations.
❖ The field of Machine Learning seeks to answer the question “Howcan we build computer systems that automatically improve withexperience, and what are the fundamental laws that govern alllearning processes?”.
Dr. G. Bharadwaja Kumar, VIT Chennai 20
What is Machine Learning?
The complexity in traditional computer programming is in the code (programs that people write). In machine learning, algorithms (programs) are in principle simple and the complexity (structure) is in the data. Is there a way that we can automatically learn that structure? That is what is at the heart of machine learning.
-- Andrew Ng
That is, machine learning is the about the construction and study of systems that can learn from data. This is very different thantraditional computer programming.
Dr. G. Bharadwaja Kumar, VIT Chennai 22
What is learning?
• Abstracting & generalizing knowledge or patterns from the data
• Required Components• Identifying the exact type of knowledge to be
learned•Representation for this target knowledge•A learning mechanism
Dr. G. Bharadwaja Kumar, VIT Chennai 23
Why Machine Learning is Hard
You See Your ML Algorithm Sees
Well Posed Learning Problems
•A computer program is said to learn from experience E w.r.t. some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.• T: class of tasks that we want computer program to do• P: measure of performance for how well computer did• E: some experience (training data) program has with task
Dr. G. Bharadwaja Kumar, VIT Chennai 25
A checkers learning problem
• A checkers learning problem• T: playing checkers
• P: percent of games won against opponents
• E playing practice games against itself
Dr. G. Bharadwaja Kumar, VIT Chennai 26
Designing a Learning System
• Consider designing a program to learn to play checkers, with the goal of entering it in the world checkers tournament
• Requires the following sets• Choosing Training Experience
• Choosing the Target Function
• Choosing the Representation of the Target Function
• Choosing the Function Approximation Algorithm
CS 484 – Artificial Intelligence 27
Choosing the Training Experience (1)
• Will the training experience provide direct or indirect feedback?• Direct Feedback: system learns from examples of individual checkers board states and the
correct move for each• Indirect Feedback: Move sequences and final outcomes of various games played
• Credit assignment problem: Value of early states must be inferred from the outcome
• Degree to which the learner controls the sequence of training examples• Teacher selects informative boards and gives correct move• Learner proposes board states that it finds particularly confusing. Teacher provides correct
moves• Learner controls board states and (indirect) training classifications
CS 484 – Artificial Intelligence 28
Choosing the Training Experience (2)
• How well the training experience represents the distribution of examples over which the final system performance P will be measured• If training the checkers program consists only of experiences played against
itself, it may never encounter crucial board states that are likely to be played by the human checkers champion
• Most theory of machine learning rests on the assumption that the distribution of training examples is identical to the distribution of test examples
CS 484 – Artificial Intelligence 29
Partial Design of Checkers Learning Program
• A checkers learning problem:• Task T: playing checkers
• Performance measure P: percent of games won in the world tournament
• Training experience E: games played against itself
• Remaining choices• The exact type of knowledge to be learned
• A representation for this target knowledge
• A learning mechanism
CS 484 – Artificial Intelligence 30
Choosing the Target Function (1)
• Assume that you can determine legal moves
• Program needs to learn the best move from among legal moves• Defines large search space known a priori• target function: ChooseMove : B → M
• ChooseMove is difficult to learn given indirect training
• Alternative target function• An evaluation function that assigns a numerical score to any given
board state• V : B → ( where is the set of real numbers)
• V(b) for an arbitrary board state b in B• if b is a final board state that is won, then V(b) = 100• if b is a final board state that is lost, then V(b) = -100• if b is a final board state that is drawn, then V(b) = 0• if b is not a final state, then V(b) = V(b '), where b' is the best final board
state that can be achieved starting from b and playing optimally until the end of the game
CS 484 – Artificial Intelligence 31
Choosing the Target Function (2)
• V(b) gives a recursive definition for board state b• Not usable because not efficient to compute except is first three trivial cases
• nonoperational definition
• Goal of learning is to discover an operational description of V
• Learning the target function is often called function approximation• Referred to as
V̂
CS 484 – Artificial Intelligence 32
Choosing a Representation for the Target Function• Choice of representations involve trade offs
• Pick a very expressive representation to allow close approximation to the ideal target function V• More expressive, more training data required to choose among alternative hypotheses
• Use linear combination of the following board features:• x1: the number of black pieces on the board• x2: the number of red pieces on the board• x3: the number of black kings on the board• x4: the number of red kings on the board• x5: the number of black pieces threatened by red (i.e. which can be captured on red's next turn)• x6: the number of red pieces threatened by black
6655443322110)(ˆ xwxwxwxwxwxwwbV
CS 484 – Artificial Intelligence 33
Partial Design of Checkers Learning Program
• A checkers learning problem:• Task T: playing checkers
• Performance measure P: percent of games won in the world tournament
• Training experience E: games played against itself
• Target Function: V: Board →
• Target function representation
6655443322110)(ˆ xwxwxwxwxwxwwbV
CS 484 – Artificial Intelligence 34
Choosing a Function Approximation Algorithm
• To learn we require a set of training examples describing the board b and the training value Vtrain(b)• Ordered pair V̂
bVb train,
100,0,0,0,1,0,3 654321 xxxxxx
CS 484 – Artificial Intelligence 35
Estimating Training Values
• Need to assign specific scores to intermediate board states
• Approximate intermediate board state b using the learner's current approximation of the next board state following b
• Simple and successful approach
• More accurate for states closer to end states
))((ˆ)( bSuccessorVbVtrain
CS 484 – Artificial Intelligence 36
Adjusting the Weights
• Choose the weights wi to best fit the set of training examples
• Minimize the squared error E between the train values and the values predicted by the hypothesis
• Require an algorithm that• will incrementally refine weights as new training examples become
available
• will be robust to errors in these estimated training values
• Least Mean Squares (LMS) is one such algorithm
examplestrainingbVb
train
train
bVbVE ,
2ˆ
CS 484 – Artificial Intelligence 37
LMS Weight Update Rule
• For each train example• Use the current weights to calculate• For each weight wi, update it as
• where• is a small constant (e.g. 0.1)
bVb train,
bV̂
itrainii xbVbVww ˆ
CS 484 – Artificial Intelligence 38
Final Design
Experiment
Generator
Performance
System
Critic
Generalizer
HypothesisNew problem
(initial game board)
Solution trace
(game history)
Training examples
V̂
,,,, 2211 bVbbVb traintrain
CS 484 – Artificial Intelligence 39
Summary of Design Choices
Determine Type
of Training Experience
Determine
Target Function
Determine Representation
of Learned Function
Determine
Learning Algorithm
Complete Design
Games against itself Table of
correct Moves
Games against
Experts
…
Board → valueBoard → move…
Linear function of six features
PolynomialArtificial neural
network…
Gradient descent Linear Programming …40
Training Classification Problems
• Many learning problems involve classifying inputs into a discrete set of possible categories.
• Learning is only possible if there is a relationship between the data and the classifications.
• Training involves providing the system with data which has been manually classified.
• Learning systems use the training data to learn to classify unseen data.
CS 484 – Artificial Intelligence 41
Rote learning
• A very simple learning method.
• Simply involves memorizing the classifications of the training data.
• Can only classify previously seen data – unseen data cannot be classified by a rote learner.
CS 484 – Artificial Intelligence 42
• Rote learning is learning without understanding the meaning of what is learned.
• For example, you can learn to make the correct response to a stimulus without discovering the conceptual category to which the stimulus belongs.
• More technically, you make the correct response without detecting the attributes that the stimulus shares with other members of the conceptual class.
Dr. G. Bharadwaja Kumar, VIT Chennai 43
• The next time you see that example, you may give the correct response.
• But what if you are given an example you haven’t seen before?
Dr. G. Bharadwaja Kumar, VIT Chennai 44
Concept Learning
Dr. G. Bharadwaja Kumar, VIT Chennai 45
Concept Learning
• Concept learning involves determining a mapping from a set of input variables to a Boolean value.
• Such methods are known as inductive learning methods.
• If a function can be found which maps training data to correct classifications, then it will also work well for unseen data – hopefully!
• This process is known as generalization.
CS 484 – Artificial Intelligence 46
Concept Learning
• Concepts are categories of stimuli that have certain features in common.
• The shapes on the right are all members of a conceptual
category: rectangle. Their common features are (1) 4 lines;
(2) opposite lines parallel; (3) lines connected at ends;
(4) lines form 4 right angles.
• The fact that they are different colors and sizes and have
different orientations is irrelevant. Color, size, and orientation
are not defining features of the concept
Dr. G. Bharadwaja Kumar, VIT Chennai 47
• If a stimulus is a member of a specified conceptual category,
it is referred to as a “positive instance”. If it is not a member,
it is referred to as “negative instance”. These are all negative
instances of the rectangle concept:
• As rectangles are defined, a stimulus is a negative instance if it
lacks any one of the specified features.
Dr. G. Bharadwaja Kumar, VIT Chennai 48
• Every concept has two components:
• Attributes: These are features of a stimulus that one must look
for to decide if that stimulus is a positive instance of the concept.
• A rule: This a statement that specifies which attributes must be present or absent for a stimulus to qualify as a positive instance
of the concept.
• For rectangles, the attributes would be the four features
discussed earlier, and the rule would be that all the attributes
must be present.
Dr. G. Bharadwaja Kumar, VIT Chennai 49
• The simplest rules refer to the presence or absence of a single attribute. For example, a “vertebrate” animal is defined as an animal with a backbone. Which of these stimuli are positive instances?
• This rule is called affirmation. It says that a stimulus must possess a single specified attribute to qualify as a positive instance of a concept.
Dr. G. Bharadwaja Kumar, VIT Chennai 50
+ + +_
• The opposite or “complement” of affirmation is is negation. To qualify as a positive instance, a stimulus must lack a single specified attribute.
• An invertebrate animal is one that lacks a backbone. These are the positive and negative instances when the negation rule is applied.
Dr. G. Bharadwaja Kumar, VIT Chennai 51
+__ _
Concept Learning
• In behavioral terms, when a concept is learned, two processes control how we respond to a stimulus:
• Generalization: We generalize a certain response (like the name of an object) to all members of the conceptual class based on their common attributes.
• Discrimination: We discriminate between stimuli which belong to the conceptual class and those that don’t because they lack one or more of the defining attributes.
Dr. G. Bharadwaja Kumar, VIT Chennai 52
Concept Learning Behavioral
Processes
For example, we generalize the word “rectangle” to those stimuli that possess the defining attributes...
...and discriminate between these stimuli and others that are outside the conceptual class, in which case we respond with a different word:
rectangle rectangle rectangle
?
Perspectives and Issues
Dr. G. Bharadwaja Kumar, VIT Chennai 54
• Data is cheap and abundant ; knowledge is expensive and scarce.
• Build a model that is a good and useful approximation to the data.
Dr. G. Bharadwaja Kumar, VIT Chennai 55
When are ML algorithms NOT needed?
❖When the relationships between all system variables (input, output, and hidden) is completely understood!
❖This is NOT the case for almost any real system!
Dr. G. Bharadwaja Kumar, VIT Chennai 56
What is needed?
• When solving a machine learning problem we must be sure to identify:
• What task is to be learned?
• How do we (will we) test the performance of our system?
• What knowledge do we want to learn?
• How do we represent this knowledge?
• What learning paradigm would be best to use?
• How do we construct a training experience for our learner?
Dr. G. Bharadwaja Kumar, VIT Chennai 57
Why Machine Learning
• Human expertise does not exist (navigating on Mars),
• Humans are unable to explain their expertise (speech recognition)
• Solution changes in time (routing on a computer network)
• Solution needs to be adapted to particular cases (user biometrics)
• Needs to identify hidden relationships and correlations within large amounts of data
• Human designers often produce machines that do not work as desired in the environments in which they are not used.
Dr. G. Bharadwaja Kumar, VIT Chennai 58
❖ The amount of knowledge available about certain tasks might be too large for explicit encoding by humans (e.g., medical diagnostic).
❖ New knowledge about tasks is constantly being discovered by humans. It may be difficult to continuously re-design systems “by hand”.
Dr. G. Bharadwaja Kumar, VIT Chennai 59
Advantages of ML
➢ Alleviate Knowledge Acquisition Bottleneck• Does not require knowledge engineers
• Scalable in constructing knowledge base
➢ Adaptive• Adaptive to the changing conditions
• Easy in migrating to new domains
• Customizing themselves to individual users
Dr. G. Bharadwaja Kumar, VIT Chennai 60
➢ Discover new knowledge from large databases (data mining).• Market basket analysis (e.g. diapers and beer)
• Medical text mining (e.g. migraines to calcium channel blockers to magnesium)
➢ To Engineer better Computing Systems
➢Build a model that is a good and useful approximation to the data.
➢Data is cheap and abundant (data warehouses, data marts); knowledge is expensive and scarce.
Dr. G. Bharadwaja Kumar, VIT Chennai 61
Issues with Machine Learning
❖What algorithms are available for learning a concept? How well do they perform?
❖How much training data is sufficient to learn a concept with high confidence?
❖When is it useful to use prior knowledge?
❖Are some training examples more useful than others?
❖What are best tasks for a system to learn?
Dr. G. Bharadwaja Kumar, VIT Chennai 62
❖What is the best way for a system to represent its knowledge?
❖How can we optimize the accuracy on future data points?
❖Are some learning problems computationally tractable?
❖How can we formulate application problems as machine learning paradigms?
Dr. G. Bharadwaja Kumar, VIT Chennai 63
Paradigms of Machine Learning Algorithms
❖Learning algorithms fall into various paradigms withrespect to the sort of feedback that the learner hasaccess to.✓Supervised Learning✓Unsupervised Learning✓Semi Supervised Learning✓Reinforcement Learning
Dr. G. Bharadwaja Kumar, VIT Chennai 64
Dr. G. Bharadwaja Kumar, VIT Chennai 65
Dr. G. Bharadwaja Kumar, VIT Chennai 66
Supervised
❖For every input, the learner is provided with a target; that is, theenvironment tells the learner what its response should.
❖The learner then compares its actual response to the target andadjusts its internal memory in such a way that it is more likely toproduce the appropriate response the next time it receives the sameinput.
❖We can think of learning a simple categorization task as supervisedlearning.
Dr. G. Bharadwaja Kumar, VIT Chennai 67
Dr. G. Bharadwaja Kumar, VIT Chennai 68
• Supervised learning: (a) presents a three-class labeled dataset, where the color shows the corresponding label of each sample. After supervised learning, the class-separating boundary could be found as the dotted lines in (b).
(a) (b)
Classification Example
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Refund MaritalStatus
TaxableIncome Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?10
Test
Set
Training
SetModel
Learn
Classifier
• Support Vector Machines
• logistic regression,
• linear discriminant analysis,
• decision trees
• k-nearest neighbor algorithm
• Neural Networks (Multilayer perceptron)
• naive Bayes etc.
Dr. G. Bharadwaja Kumar, VIT Chennai 70
Unsupervised
❖The learner receives no feedback from the world at all.
❖Instead the learner's task is to re-represent the inputs in a moreefficient way, as clusters or categories or using a reduced set ofdimensions.
❖Unsupervised learning is based on the similarities and differencesamong the input patterns. It does not result directly in differences inovert behavior because its "outputs" are really internalrepresentations.
Dr. G. Bharadwaja Kumar, VIT Chennai 71
Dr. G. Bharadwaja Kumar, VIT Chennai 72(a) (b)
Clustering
•The goal of clustering is to
• group data points that are close (or similar) to each other
• identify such groupings (or clusters) in an unsupervised manner i.e. no information is provided to the algorithm on which data points belong to which clusters
Examples of Clustering
• Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts.• Tailor-made for each person: too expensive
• One-size-fits-all: does not fit all.
• Example 2: In marketing, segment customers according to their similarities• To do targeted marketing.
• Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics.
• Spatial Data Analysis
• E. g., land use, city planning, earth-quake studies
• Neural network models (self-organizing map (SOM) and adaptive resonance theory (ART))
• Clustering (e.g., k-means, Gaussian mixture models, k-mode)
Dr. G. Bharadwaja Kumar, VIT Chennai 75
❖Semi-supervised learning falls between unsupervised learning(without any labeled training data) and supervised learning (withcompletely labeled training data).
❖It makes use of both labeled and unlabeled data for training -typically a small amount of labeled data with a large amount ofunlabeled data.
Dr. G. Bharadwaja Kumar, VIT Chennai 76
Dr. G. Bharadwaja Kumar, VIT Chennai 77
• Semi-supervised learning. (a) presents a labeled dataset (with red, green, and blue) together with a unlabeled dataset (marked with black). The distribution of the unlabeled dataset could guide the position of separating boundary.
(a) (b)
• Semi-Supervised Support Vector Machines (S3VMs)
• Laplacian Regularized Least Squares (LapRLS)
• Semi-Supervised Random Forests
Dr. G. Bharadwaja Kumar, VIT Chennai 78
Reinforcement learning
❖The learner receives feedback about the appropriateness of itsresponse.
❖For correct responses it resembles supervised learning: However, thetwo forms of learning differ significantly for errors, situations in whichthe learner's behavior is in some way inappropriate.
❖ In these situations, supervised learning lets the learner know exactlywhat it should have done, whereas reinforcement learning only saysthat the behavior was inappropriate and (usually) how inappropriateit was.
Dr. G. Bharadwaja Kumar, VIT Chennai 79
Dr. G. Bharadwaja Kumar, VIT Chennai 80
• Q-Learning
• Temporal Difference Learning:
• Prioritized Sweeping
• Dynamic Bayesian Network-Markov Decision Process
Dr. G. Bharadwaja Kumar, VIT Chennai 81
Classification of ML Algorithms
❖Based on how data is available to learning algorithms or way learning happens✓Batch Learning✓Online Learning✓Instance Based Learning✓Incremental learning✓Deep Learning✓Evolutionary Learning✓Sequence learning
Dr. G. Bharadwaja Kumar, VIT Chennai 82
Batch Learning or Offline Learning
• Machine learning algorithms assume they have access to the entire training dataset at once.
• In general, Machine learning fall into this category.
• SVM, Neural Networks (MLP) etc.
Dr. G. Bharadwaja Kumar, VIT Chennai 83
Online Learning
• Data available in a sequential fashion at a very high rate i.e. data streams and not possible to store the data which enforces real time analysis
• Slightly different characteristics than time series data
• Examples of data streams include computer network traffic, web searches, and sensor data.
Dr. G. Bharadwaja Kumar, VIT Chennai 84
• VERY FAST DECISION TREE (VFDT)
• Concept-Adapting Very Fast Decision Tree (CVFDT)
• BIRCH
• STREAM
• CluStream
Dr. G. Bharadwaja Kumar, VIT Chennai 85
Sequence Learning
• Most machine learning algorithms are designed for independent, identically distributed (i.i.d.) data
• Sequence learning is the study of machine learning algorithms designed for sequential data. These algorithms should• not assume data points to be independent i.e. the data instances
are strongly correlated
• able to deal with sequential distortions
• make use of context information
Dr. G. Bharadwaja Kumar, VIT Chennai 86
• Applications include speech recognition, gesture recognition, protein secondary structure prediction, handwriting recognition.
• Algorithms: Hidden Markov Models(HMM), Conditional Random Fields (CRF), Maximum-Entropy Markov model (MEMM),
Dr. G. Bharadwaja Kumar, VIT Chennai 87
Instance-based or Memory-based learning
• Instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory.
• K-nearest neighbor, Neural Network (RBF), Locally Weighted Regression
Dr. G. Bharadwaja Kumar, VIT Chennai 88
Incremental Learning
• Capable to learn and update with every new data – labeled or unlabeled.
Dr. G. Bharadwaja Kumar, VIT Chennai 89
• Incremental SVM
• Incremental HMM
• Incremental Sigmoid Belief Networks (ISBNs)
• Incremental Decision Trees (ID5R)
Dr. G. Bharadwaja Kumar, VIT Chennai 90
Deep Learning
• Deep Learning is about learning multiple levels of representation and abstraction that help to make sense of data such as images, sound, and text.
• Deep Boltzmann Machine, Convolutional Deep Neural Networks, Deep Belief Networks
Dr. G. Bharadwaja Kumar, VIT Chennai 91
Displaying the structure of a set of documents using Latent Semantic Analysis (a form of PCA)
Each document is converted to a vector of word counts. This vector is then mapped to two coordinates and displayed as a colored dot. The colors represent the hand-labeled classes.
When the documents are laid out in 2-D, the classes are not used. So we can judge how good the algorithm is by seeing if the classes are separated.
Displaying the structure of a set of documents using a deep neural network
Regularization
• An extension made to the basic learning method that penalizes models based on their complexity, favoring simpler models that are also better at generalizing.
• The most popular regularization algorithms are:
• Ridge Regression
• Least Absolute Shrinkage and Selection Operator (LASSO)
• Elastic Net
• Least-Angle Regression (LARS)
Dr. G. Bharadwaja Kumar, VIT Chennai 94
Ensemble Algorithms
• Ensemble methods are models composed of multiple weaker models that are independently trained and whose predictions are combined in some way to make the overall prediction.
• Boosting• Bootstrapped Aggregation (Bagging)• AdaBoost• Stacked Generalization (blending)• Gradient Boosting Machines (GBM)• Gradient Boosted Regression Trees (GBRT)• Random Forest
Dr. G. Bharadwaja Kumar, VIT Chennai 95
Dimensionality Reduction Algorithms
• This can be useful to reduce the number of features either to visualize dimensional data or simplify learning process
• Principal Component Analysis (PCA)
• Principal Component Regression (PCR)
• Partial Least Squares Regression (PLSR)
• Sammon Mapping
• Multidimensional Scaling (MDS)
• Projection Pursuit
• Linear Discriminant Analysis (LDA)
• Mixture Discriminant Analysis (MDA)
• Quadratic Discriminant Analysis (QDA)
• Flexible Discriminant Analysis (FDA)
Dr. G. Bharadwaja Kumar, VIT Chennai 96
Linear&
Non-Linear Separability
Dr. G. Bharadwaja Kumar, VIT Chennai 97
Linear Separability
• Let X0 and X1 be two sets of points in an n-dimensional Euclidean space. Then X0 and X1 are linearly separable if there exists n + 1 real numbers w1, w2,..,wn, k, such that every point x in X0
satisfies and every point x in X1 satisfies
where xi is the ith component of x.
Dr. G. Bharadwaja Kumar, VIT Chennai 98
𝑖=1
𝑛
𝑤𝑖𝑥𝑖 > 𝑘
𝑖=1
𝑛
𝑤𝑖𝑥𝑖 < 𝑘
• Linearly separable data: if all the data points can be correctly classified by a linear decision boundary (line).
Dr. G. Bharadwaja Kumar, VIT Chennai 99
Dr. G. Bharadwaja Kumar, VIT Chennai 100
These two sets are linearly separable because there exists a line in the plane with all of the blue points on one side of the line and all the red points on the other side. This idea immediately generalizes to higher-dimensional Euclidean spaces if line is replaced by hyper plane.
• If not linearly separable✓Allow some errors
✓Still, try to place hyperplane “far” from each class
Dr. G. Bharadwaja Kumar, VIT Chennai 101
Non-Linear Separability
Dr. G. Bharadwaja Kumar, VIT Chennai 102
Non Linear problem
Dr. G. Bharadwaja Kumar, VIT Chennai 104
Linear Separability
• Linear or Non linear separable data?• We can find out only empirically
• Linear algorithms (algorithms that find a linear decision boundary)• When we think the data is linearly separable
• Advantages• Simpler, less parameters
• Disadvantages• High dimensional data is usually not linearly separable
• Examples: Perceptron, SVM
Dr. G. Bharadwaja Kumar, VIT Chennai 105
Nonlinear Separability
• It is well known that a non linear mapping from a small dimensional space into a high-dimensional space facilitates linear classification.
Dr. G. Bharadwaja Kumar, VIT Chennai 106
• In (a) a two-dimensional input space is depicted, in which the yellow spheres and the red stars cannot be separated with a single straight line.
• With a nonlinear mapping into a three-dimensional space, as depicted in (b), the spheres and stars can be separated by a single linear hyperplane.
Dr. G. Bharadwaja Kumar, VIT Chennai 107
Dr. G. Bharadwaja Kumar, VIT Chennai 108
Dr. G. Bharadwaja Kumar, VIT Chennai 109
Radial Basis Function (RBF) kernel in LIBSVM
Multiclass or multinomial classification
•Given: some data items that belong to one of M mutually-exclusive classes
•Task: Train the classifier and predict the class for a new data item
•Geometrically: harder problem, no more simple geometry
Dr. G. Bharadwaja Kumar, VIT Chennai 110
Multi-class classification
Dr. G. Bharadwaja Kumar, VIT Chennai 111
• For example, classifying a set of images of fruits which may be oranges, apples, or pears.
• Multiclass classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.
• The first category of algorithms include decision trees, neural networks, k-Nearest Neighbor, Naive Bayes classifiers.
Dr. G. Bharadwaja Kumar, VIT Chennai 112
• The basic SVM supports only binary classification, but extensions have been proposed to handle the multiclass classification case as well.
Dr. G. Bharadwaja Kumar, VIT Chennai 113
• The multiclass classification problem can be decomposed into several binary classification tasks that can be solved efficiently using binary classifiers.
• The most successful and widely used binary classifers are the Support Vector Machines. The idea is similar to that of using codewords for each class and then using a number of binary classifiers in solving several binary classification problems, whose results can determine the class label for new data.
Dr. G. Bharadwaja Kumar, VIT Chennai 114
One-versus-all (OVA)
• The simplest approach is to reduce the problem of classifying among K classes into K binary problems, where each problem discriminates a given class from the other K-1 classes
• When testing an unknown example, the classifier producing the maximum ouput is considered the winner, and this class label is assigned to that example
Dr. G. Bharadwaja Kumar, VIT Chennai 115
All-versus-all (AVA)
• In this approach, each class is compared to each other class . A binary classifier is built to discriminate between each pair of classes, while discarding the rest of the classes. This requires building K(K−1)/2 binary classifiers.
• When testing a new example, a voting is performed among the classifiers and the class with the maximum number of votes wins
Dr. G. Bharadwaja Kumar, VIT Chennai 116
Error-Correcting Output-Codes
• Output-code based strategies are fairly different from one-vs-the-rest and one-vs-one. With these strategies, each class is represented in a euclidean space, where each dimension can only be 0 or 1. Another way to put it is that each class is represented by a binary code (an array of 0 and 1). The matrix which keeps track of the location/code of each class is called the code book. The code size is the dimensionality of the aforementioned space. Intuitively, each class should be represented by a code as unique as possible and a good code book should be designed to optimize classification accuracy.
Dr. G. Bharadwaja Kumar, VIT Chennai 117
Dr. G. Bharadwaja Kumar, VIT Chennai 118
Multi-Label Classification
•Given: some data items that belong to more than one class of M possible classes
• Task: Train the classifier and predict the class for a new data item
•Geometrically: harder problem, no more simple geometry
Dr. G. Bharadwaja Kumar, VIT Chennai 119
Multi-label classification: Examples• Language identification
• Text categorization (topics)
• For instance, an article in a newspaper may be assigned to the categories POLITICS, SPORTS, RELIGION, etc.
• Some classification algorithms/models have been adapted to the multi-label task, without requiring problem transformations. Examples of these include:
• boosting: AdaBoost.MH and AdaBoost.MR are extended versions of AdaBoost for multi-label data.
• k-nearest neighbors: the ML-kNN algorithm extends the k-NN classifier to multi-label data.
• decision trees: "Clare" is an adapted C4.5 algorithm for multi-label classification; the modification involves the entropy calculations.
• neural networks: BP-MLL is an adaptation of the popular back-propagation algorithm for multi-label learning.
Dr. G. Bharadwaja Kumar, VIT Chennai 121
• Various binary classifiers have been developed over time and there is no clear winner as to which classifier performs the best. Different classifiers perform differently depending on the number of observations, the dimensionality of the feature vector, the noise in the data and various other factors. For e.g. random forests perform better than SVM classifiers for 3D point clouds
Dr. G. Bharadwaja Kumar, VIT Chennai 122
General Assumptions on Dataset
•In machine learning, an unknown universal dataset is assumed to exist, which contains all the possible data pairs as well as their probability distribution of appearance in the real world.
Dr. G. Bharadwaja Kumar, VIT Chennai 123
• While in real applications, what we observed is only a subset of the universal dataset due to the lack of memory or some other unavoidable reasons.
• This acquired dataset is called the training set (training data) and used to learn the properties and knowledge of the universal dataset.
• In general, vectors in the training set are assumed independently and identically sampled (i.i.d) from the universal dataset
Dr. G. Bharadwaja Kumar, VIT Chennai 124
Dr. G. Bharadwaja Kumar, VIT Chennai 125
Bias – Variance Tradeoff
Dr. G. Bharadwaja Kumar, VIT Chennai 126
• The bias-variance tradeoff is an important aspect of data science projects based on machine learning.
• Any learning algorithm that use mathematical or statistical models whose “error” can be split into two main components: reducible and irreducible error.
• Irreducible error or inherent uncertainty is associated with a natural variability in a system. On the other hand, reducible error, as the name suggests, can be and should be minimized further to maximize accuracy.
Dr. G. Bharadwaja Kumar, VIT Chennai 127
Dr. G. Bharadwaja Kumar, VIT Chennai 128
• Suppose our outcome variable is Y and covariates are X, we may assume that there is a relationship relating one to the other such as Y=f(X)+ϵ where the error term ϵ is normally distributed with a mean of zero like so ϵ∼N(0,σϵ)
Dr. G. Bharadwaja Kumar, VIT Chennai 129
We may estimate a model f^(x) of f(X) using linear regressions or another modeling technique. In this case, the expected squared prediction error at a point x is:
• That third term, irreducible error, is the noise term in the true relationship that cannot fundamentally be reduced by any model. Given the true model and infinite data to calibrate it, we should be able to reduce both the bias and variance terms to 0.
• However, in a world with imperfect models and finite data, there is a tradeoff between minimizing the bias and minimizing the variance.
Dr. G. Bharadwaja Kumar, VIT Chennai 130
• Reducible error can be further decomposed into “error due to bias” and “error due to variance.” The data scientist’s goal is to simultaneously reduce bias and variance as much as possible in order to obtain as accurate model as is feasible.
• However, there is a tradeoff to be made when selecting models of different flexibility or complexity and in selecting appropriate training sets to minimize these sources of error!
Dr. G. Bharadwaja Kumar, VIT Chennai 131
• The bias is error from erroneous assumptions in the learning algorithm or model mismatch. High bias can cause an algorithm to miss the relevant relations between features and target outputs (under fitting).
• The variance is error from variation due to train sample and randomization. High variance can cause (over fitting): modeling the random noise in the training data, rather than the intended outputs.
Dr. G. Bharadwaja Kumar, VIT Chennai 132
• The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data.
• Unfortunately, it is typically impossible to do both simultaneously.
• High-variance learning methods may be able to represent their training set well, but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that don't tend to overfit, but may underfit their training data, failing to capture important regularities.
Dr. G. Bharadwaja Kumar, VIT Chennai 133
Error due to Bias
• The error due to bias is taken as the difference between the expected (or average obtained from cross-validation) prediction of our model and the correct value which we are trying to predict.
• Bias measures how far off in general these models' predictions are from the correct value. If these average prediction values are substantially different that the true value, bias will be high.
Dr. G. Bharadwaja Kumar, VIT Chennai 134
Error due to Variance
• The error due to variance is the amount by which the prediction, over one training set, differs from the expected predicted value, over all the training sets.
Dr. G. Bharadwaja Kumar, VIT Chennai 135
Graphical Visualization of bias and variance using a bulls-eye diagram
Dr. G. Bharadwaja Kumar, VIT Chennai 136
Dr. G. Bharadwaja Kumar, VIT Chennai 137
Dr. G. Bharadwaja Kumar, VIT Chennai 138
Dr. G. Bharadwaja Kumar, VIT Chennai 139
• In fact, your linear model is underfitting the nonlinear target function over the training set. Likewise, if your target truth is linear, and you select a nonlinear model to approximate it, then you’re introducing a bias resulting from the nonlinear model’s inability to be linear where it needs to be. In fact, the nonlinear model is overfitting the linear target function over the training set.
Dr. G. Bharadwaja Kumar, VIT Chennai 140
bias vs. variance
high biaslow variance
medium biasmedium variance
low biashigh variance
Bias-variance tradeoff
Dr. G. Bharadwaja Kumar, VIT Chennai 142
Training error
Test error
Underfitting Overfitting
Complexity Low Bias
High Variance
High Bias
Low Variance
Err
or
Bias-variance tradeoff
Dr. G. Bharadwaja Kumar, VIT Chennai 143
Many training examples
Few training examples
Complexity Low Bias
High Variance
High Bias
Low Variance
Test E
rror
Effect of Training Size
Dr. G. Bharadwaja Kumar, VIT Chennai 144
Effect of Model Complexity
Dr. G. Bharadwaja Kumar, VIT Chennai 145
Computational Learning Theory
Dr. G. Bharadwaja Kumar, VIT Chennai 146
Learning Theory
•COLT helps to define the class of learnable concepts in terms of computational complexityi.e. the time and space complexity of the learning algorithm, which depends on the cost of the computational representation of the concepts and sample complexity, i.e. the number of training examples necessary or sufficient to learn hypotheses of a given accuracy.
Dr. G. Bharadwaja Kumar, VIT Chennai 147
•A good hypothesis is a productive one. A productive hypothesis can:• Be easily learned and applied• Explain the past accurately and persuasively• Make accurate predictions about the future• Generate new even more useful hypotheses• Be applied to a wide variety of situations• Be easily tested
Dr. G. Bharadwaja Kumar, VIT Chennai 148
• Learning in the limit: Is the learner guaranteed to converge to the correct hypothesis in the limit as the number of training examples increases indefinitely?
• Sample Complexity: Can one characterize the number of training examples necessary/sufficient for highly accurate learning?
• Computational Complexity: How much computational resources (time and space) are needed for a learner to learn a highly accurate hypothesis?
• Is it possible to identify classes of concepts that are inherently difficult/easy to learn, independent of the learning algorithm?
Dr. G. Bharadwaja Kumar, VIT Chennai 149
• Mistake Bound: how many training examples will the learner misclassify before constructing a highly accurate concept
Dr. G. Bharadwaja Kumar, VIT Chennai 150
Two frameworks for analyzing learning algorithms
• Probably Approximately Correct (PAC)framework• Identify classes of hypotheses that can/cannot be learned from a polynomial
number of training samples• Finite hypothesis space
• Infinite hypotheses (VC dimension)
• Define natural measure of complexity for hypothesis spaces (VC dimension)that allows bounding the number of training examples required for inductive learning
• Mistake bound framework• Number of training errors made by a learner before it determines correct
hypothesis
Dr. G. Bharadwaja Kumar, VIT Chennai 151
PAC Learning
Dr. G. Bharadwaja Kumar, VIT Chennai 152
• PAC Model• Only requires learning a Probably Approximately Correct Concept: Learn a
decent approximation most of the time.
• Requires polynomial sample complexity and computational complexity.
Dr. G. Bharadwaja Kumar, VIT Chennai 153
• PAC-learnability is most determined by the number of training examples required by the learner. The sample complexity of the learning problem is the growth in the number of required training examples with problem size.
• This is because in practical settings the most limiting factor of the learner is the number of training examples available.
Dr. G. Bharadwaja Kumar, VIT Chennai 154
• If we characterize the number of training examples needed to learn a hypothesis h for which error = 0. Unfortunately, it turns out this is futile in the setting we are considering, for two reasons.
• First, unless we provide training examples corresponding to every possible instance in X (an unrealistic assumption), there may be multiple hypotheses consistent with the provided training examples, and the learner cannot be certain to pick the one corresponding to the target concept.
• Second, given that the training examples are drawn randomly, there will always be some nonzero probability that the training examples encountered by the learner will be misleading.
Dr. G. Bharadwaja Kumar, VIT Chennai 155
• To accommodate these two difficulties, we weaken our demands on the learner in two ways.
• First, we will not require that the learner output a zero error hypothesis-we will require only that its error be bounded by some constant, , that can be made arbitrarily small. Second, we will not require that the learner succeed for every sequence of randomly drawn training examples-we will require only that its probability of failure be bounded by some constant, , that can be made arbitrarily small.
Dr. G. Bharadwaja Kumar, VIT Chennai 156
• The only reasonable expectation of a learner is that with high probability it learns a close approximation to the target concept.
• In the PAC model, we specify two small parameters, εand δ, and require that with probability at least (1 - δ) a system learn a concept with error at most ε .
Dr. G. Bharadwaja Kumar, VIT Chennai 157
Chernoff‐Hoeffding bound
• If you average a bunch of bounded random variables, then the probability this average random variable deviates from its expectation is exponentially small in the amount of deviation.
• Let be independent random variables whose values are in the range [0,1] . let
, ,
Then for all ε > 0 ,
Dr. G. Bharadwaja Kumar, VIT Chennai 158
𝑿𝟏, 𝑿𝟐, . . . , 𝑿𝒎
𝑿 =𝒊𝑿𝒊 𝝁𝒊 = 𝑬 𝑿𝒊 𝝁 = 𝑬 𝑿 =
𝒊𝝁𝒊
𝐏𝐫(|𝑿 − 𝝁| > 𝜺) ≤ 𝟐𝒆−𝟐 Τ𝜺𝟐 𝒎
• One nice thing about the Chernoff bound is that it doesn’t matter how the variables are distributed.
• This is important because in PAC we need guarantees that hold for any distribution generating data.
• Indeed, in this case the random variables above will be individual examples drawn from the distribution generating the data.
Dr. G. Bharadwaja Kumar, VIT Chennai 159
• We’ll be estimating the probability that our hypothesis has error deviating more than ε, and we want to bound this by , as in the definition of PAC‐learning.
• Since the amount of deviation (error ε) and the number of samples (m ) both occur in the exponent, the trick is in balancing the two values to get what we want.
Dr. G. Bharadwaja Kumar, VIT Chennai 160
• A algorithm that efficiently finds a consistent hypothesis will PAC‐learn any finite concept class provided it has at least samples, where
Dr. G. Bharadwaja Kumar, VIT Chennai 161
𝑚 ≥ Τ1 휀 𝑙𝑜𝑔|𝐻| + 𝑙𝑜𝑔 Τ1 𝛿
Dr. G. Bharadwaja Kumar, VIT Chennai 162
Dr. G. Bharadwaja Kumar, VIT Chennai 163
Dr. G. Bharadwaja Kumar, VIT Chennai 164
Dr. G. Bharadwaja Kumar, VIT Chennai 165
Inductive Learning Hypothesis
•Any hypothesis found to approximate the target function well over the training examples, will also approximate the target function well over the unobserved examples.
• find the hypothesis that best fits the training data
Dr. G. Bharadwaja Kumar, VIT Chennai 166
• Contains all plausible versions of the target concept
Dr. G. Bharadwaja Kumar, VIT Chennai 167
• A hypothesis h is consistent with training examples D
iff h(x)=c(x) for each example <x, c(x)>in D
• •Version space with respect to
• hypothesis H and
• training examples D,
• is a subset of hypotheses from H consistent with the training examples in D
Dr. G. Bharadwaja Kumar, VIT Chennai 168
• VC dimension ( Vapnik–Chervonenkis dimension) is a measure of the capacity(complexity, expressive power, richness, or flexibility) of a classification algorithm, defined as the cardinality of the largest set of points that the algorithm can shatter.
Dr. G. Bharadwaja Kumar, VIT Chennai 169
Dr. G. Bharadwaja Kumar, VIT Chennai 170
The probabilistic guarantee
where N = size of training set
h = VC dimension of the model class = complexity
p = upper bound on probability that this bound fails
So if we train models with different complexity, we should pick the one that minimizes this bound
Actually, this is only sensible if we think the bound is fairly tight, which it usually isn’t. The theory provides insight, but in practice we still need some witchcraft.
2
1
)4/log()/2log(
N
phNhhEE traintest
A simple example: Fitting a polynomial
• The green curve is the true function (which is not a polynomial)
• The data points are uniform in x but have noise in y.
• We will use a loss function that measures the squared error in the prediction of y(x) from x. The loss for the red polynomial is the sum of the squared vertical errors.
from Bishop
Some fits to the data: which is best?
from Bishop
A simple way to reduce model complexity
• If we penalize polynomials that have big values for their coefficients, we will get less wiggly solutions:
2
1
||||}),({)(~
2
2
2
1www
nn txyE
N
n
regularization parameter
target value
penalized loss function
from Bishop
Regularization: vs.
Polynomial Coefficients
Generalization
• The real aim of Machine Learning is to do well on test data that is not known during learning.
Dr. G. Bharadwaja Kumar, VIT Chennai 177
Generalization
• Choosing the values for the parameters that minimize the loss function on the training data is not necessarily the best policy.
• We want the learning machine to model the true regularities in the data and to ignore the noise in the data. • But the learning machine does not know which
regularities are real and which are accidental quirks of the particular set of training examples we happen to pick.
• So how can we be sure that the machine will generalize correctly to new data?
Dr. G. Bharadwaja Kumar, VIT Chennai 178
Generalization
• One can only say a model is generalized well if it explains the data surprisingly well given the complexity of the model.
• If the model has as many degrees of freedom as the data, it can fit the data perfectly but so what?
• There is a lot of theory about how to measure the model complexity and how to control it to optimize generalization.
Dr. G. Bharadwaja Kumar, VIT Chennai 179
Dr. G. Bharadwaja Kumar, VIT Chennai 180
Generative vs. Discriminative
Dr. G. Bharadwaja Kumar, VIT Chennai 181
Generative vs. Discriminative
• Generative models: specify a joint probability distribution over observation and label sequences i.e. full probabilistic model of all variables• Model class-conditional pdfs and prior probabilities
• Discriminative models: provide a model only for the target variable(s) conditional on the observed variables.• Directly estimate posterior probabilities
• No attempt to model underlying probability distributions
Dr. G. Bharadwaja Kumar, VIT Chennai 182
Generative vs. Discriminative
Dr. G. Bharadwaja Kumar, VIT Chennai 183
Discriminative Genarative
Generative Methods
• ☺ Relatively straightforward to characterize invariances
• ☺ They can handle partially labelled data
• They wastefully model variability which is unimportant for classification
• They scale badly with the number of classes and the number of invariant transformations
• Slow on test data
• higher asymptotic error
Dr. G. Bharadwaja Kumar, VIT Chennai 184
Discriminative Methods
• ☺ They can be very fast once trained
• ☺ lower asymptotic error
• inherently supervised, cannot deal with unlabelled data
• They interpolate between training examples, and hence can fail if novel inputs are presented
• They don’t easily handle compositionality
Dr. G. Bharadwaja Kumar, VIT Chennai 185
Generative Vs. Discriminative
Generative• Naïve Bayes
• Mixtures of Gaussians
• Hidden Markov Models (HMM)
• Bayesian networks
• Markov random fields
Discriminative• Logistic regression
• SVMs
• Neural networks (MLP, RBF)
• Nearest neighbor
• Conditional Random Fields (CRF)
Dr. G. Bharadwaja Kumar, VIT Chennai 186
Parametric vs non-parametric models
• Parametric models assume some finite set of parameters θ.
• Given the parameters, future predictions, x, are independent of the observed data, D:
P(x|θ, D) = P(x|θ)
• therefore θ capture everything there is to know about the data. • So the complexity of the model is bounded even if the amount of data is unbounded. This makes them not very flexible
Dr. G. Bharadwaja Kumar, VIT Chennai 187
• Non-parametric model is not none-parametric: parameters are determined by the training data, not the model.
• Non-parametric models assume that the data distribution cannot be defined in terms of such a finite set of parameters. But they can often be defined by assuming an infinite dimensional θ. Usually we think of θ as a function.
• The amount of information that θ can capture about the data D can grow as the amount of data grows. This makes them more flexible.
Dr. G. Bharadwaja Kumar, VIT Chennai 188
• Monte Carlo Hidden Markov Models: non-parametric hidden Markov models with continuous state and observation spaces.
• Based on the Dirichlet Process, a nonparametric Bayesian Hidden Markov Model is proposed, which allows an infinite number of hidden states and uses an infinite number of Gaussian components to support continuous observations
Dr. G. Bharadwaja Kumar, VIT Chennai 189
Parameters to learn
• Two kinds of parameters
• one the user sets for the training procedure in advance –hyperparameter• the degree of polynomial to match in regression• number/size of hidden layer in Neural Network• number of instances per leaf in decision tree
• one that actually gets optimized through the training – parameter• regression coefficients• network weights• size/depth of decision tree
• we usually do not talk about the latter, but refer to hyperparametersas parameters
Dr. G. Bharadwaja Kumar, VIT Chennai 190
Dr. G. Bharadwaja Kumar, VIT Chennai 191
•Machine Learning algorithms rely on three components:
✓ Representation
✓ Optimization
✓ Evaluation
Dr. G. Bharadwaja Kumar, VIT Chennai 192
Model Selection
Dr. G. Bharadwaja Kumar, VIT Chennai 193
• The Akaike Information Criterion (AIC) is a way of selecting a model from a set of models. The chosen model is the one that minimizes the Kullback-Leibler distance between the model and the truth. It’s based on information theory, but a heuristic way to think about it is as a criterion that seeks a model that has a good fit to the truth but few parameters. It is defined as:
AIC = -2 ( ln ( likelihood )) + 2 K
where likelihood is the probability of the data given a model and K is the number of free parameters in the model.
Dr. G. Bharadwaja Kumar, VIT Chennai 194
• The second order information criterion, often called AICc, takes into account sample size by, essentially, increasing the relative penalty for model complexity with small data sets. It is defined as:
AICc = -2 ( ln ( likelihood )) + 2 K * (n / ( n – K – 1))
where n is the sample size. As n gets larger, AICc converges to AIC ( n – K – 1 n as n gets much bigger than K, and so (n / ( n – K – 1)) approaches 1), and so there’s really no harm in always using AICcregardless of sample size.
Dr. G. Bharadwaja Kumar, VIT Chennai 195
Minimum Description Length
• MDL is an information-theoretic approach to machine learning, or statistical model selection
• Basically says you should pick the model which gives you the most compact description of the data, including the description of the model itself.
• Provides a criterion for the selection of models, regardless of their complexity, without the restrictive assumption that the data form a sample from a 'true' distribution.
Dr. G. Bharadwaja Kumar, VIT Chennai 196
• Hence you really want to minimize the combined length of the description of the model, plus the description of the data under that model.
• The fundamental idea behind the MDL Principle is that any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally
Dr. G. Bharadwaja Kumar, VIT Chennai 197
198
No
Cannot Learn Exact Conceptsfrom Limited Data, Only Approximations
Learner Classifier
Positive
Negative
Yes
•A measure of the “power” or the “complexity” of the hypothesis space •Higher VC dimension implies a more “expressive”
hypothesis space
•Shattering: A set of N points is shattered if there exists a hypothesis that is consistent with every classification of the N points
Dr. G. Bharadwaja Kumar, VIT Chennai 199
VC Dimension
Dr. G. Bharadwaja Kumar, VIT Chennai 200
No free lunch theorem
• All models are wrong, but some models are useful. —George Box
• Much of machine learning is concerned with devising different models to fit the data.
• We can use methods such as cross validation to empirically choose the best model for a particular problem.
• However, there is no universally best model — this is sometimes called the no free lunch theorem (Wolpert 1996).
Dr. G. Bharadwaja Kumar, VIT Chennai 201
• The no free lunch rule for dataset: (a) is the training set we have, and (b), (c) are two test sets. As you can see, (c) has different sample distributions from (a) and (b), so we cannot expect that the properties learned from (a) to be useful in (c).
Dr. G. Bharadwaja Kumar, VIT Chennai 202
The End
Dr. G. Bharadwaja Kumar, VIT Chennai 203