203
Machine Learning Dr. G. Bharadwaja Kumar VIT Chennai

Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Machine Learning

Dr. G. Bharadwaja Kumar

VIT Chennai

Page 2: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Important developments in Computer Science

1. Bio Metrics: Speaker verification, Face, iris, finger print

2. Finance: Credit scoring, fraud detection

3. Manufacturing: Optimization, troubleshooting

4. Medicine: Clinical diagnosis

5. Telecommunications: Quality of service optimization

6. Stock market forecasting

7. Hand written character recognition

8. Autonomous robot control

9. Spam email detection

10. ...

Dr. G. Bharadwaja Kumar, VIT Chennai 2

Page 3: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 3

Page 4: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Watson - Jeopardy

Dr. G. Bharadwaja Kumar, VIT Chennai 4

Page 5: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 5

Page 6: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 6

Page 7: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Speech Recognition

Dr. G. Bharadwaja Kumar, VIT Chennai 7Apple Smart Watch with SIRI Apple Smart Phone with SIRI

Page 8: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 8

Page 9: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Medical Diagnosis

• Assist in decision making with a large number of inputs and in stressful situations

Dr. G. Bharadwaja Kumar, VIT Chennai 9

Page 10: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Medical Diagnosis

Dr. G. Bharadwaja Kumar, VIT Chennai 10

Page 11: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 11

Page 12: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

12

ML Application: Loan Approvals

income debt married age approve deny

John Smith 200,000 0 yes 80

Peter White 60,000 1,000 no 30

Ann Clark 100,000 10,000 yes 40

Susan Ho 0 20,000 no 25

• Objects – people

• Classes – “approve”, “deny”

Page 13: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Biometrics

Dr. G. Bharadwaja Kumar, VIT Chennai 13

Page 14: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 14

Page 15: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 15

Page 16: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 16

Page 17: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

What is Artificial Intelligence ?

• Branch of computer science which • is the science of making machines do things that would require

intelligence if done by men (Minsky)

• is the exciting new effort to make computers think (Haugeland)

• is the study of the computations that make it possible to perceive, reason, and act (Winston)

• is the study of how to do things which at the moment people do better (Rich & Knight)

• “Artificial Intelligence” term was coined in 1956 by John McCarthy at MIT

Dr. G. Bharadwaja Kumar, VIT Chennai 17

Page 18: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• AI is an extensive field of Computer Science

• There are many sub-fields of AI• Machine Learning

• Natural Language processing

• Speech Recognition

• Computer Vision

Dr. G. Bharadwaja Kumar, VIT Chennai 18

Page 19: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Machine Learning : Definition

• The goal of machine learning is to build computer systems that can adapt and learn from example data and past experience & optimize their performance.”

• Tom Dietterich

Dr. G. Bharadwaja Kumar, VIT Chennai 19

Page 20: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Other Definitions

❖ Machine learning expedite computers to learn and improveautomatically using example data or past experience and enablecomputers to handle new situations.

❖ The field of Machine Learning seeks to answer the question “Howcan we build computer systems that automatically improve withexperience, and what are the fundamental laws that govern alllearning processes?”.

Dr. G. Bharadwaja Kumar, VIT Chennai 20

Page 21: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

What is Machine Learning?

The complexity in traditional computer programming is in the code (programs that people write). In machine learning, algorithms (programs) are in principle simple and the complexity (structure) is in the data. Is there a way that we can automatically learn that structure? That is what is at the heart of machine learning.

-- Andrew Ng

That is, machine learning is the about the construction and study of systems that can learn from data. This is very different thantraditional computer programming.

Page 22: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 22

Page 23: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

What is learning?

• Abstracting & generalizing knowledge or patterns from the data

• Required Components• Identifying the exact type of knowledge to be

learned•Representation for this target knowledge•A learning mechanism

Dr. G. Bharadwaja Kumar, VIT Chennai 23

Page 24: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Why Machine Learning is Hard

You See Your ML Algorithm Sees

Page 25: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Well Posed Learning Problems

•A computer program is said to learn from experience E w.r.t. some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.• T: class of tasks that we want computer program to do• P: measure of performance for how well computer did• E: some experience (training data) program has with task

Dr. G. Bharadwaja Kumar, VIT Chennai 25

Page 26: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

A checkers learning problem

• A checkers learning problem• T: playing checkers

• P: percent of games won against opponents

• E playing practice games against itself

Dr. G. Bharadwaja Kumar, VIT Chennai 26

Page 27: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Designing a Learning System

• Consider designing a program to learn to play checkers, with the goal of entering it in the world checkers tournament

• Requires the following sets• Choosing Training Experience

• Choosing the Target Function

• Choosing the Representation of the Target Function

• Choosing the Function Approximation Algorithm

CS 484 – Artificial Intelligence 27

Page 28: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Choosing the Training Experience (1)

• Will the training experience provide direct or indirect feedback?• Direct Feedback: system learns from examples of individual checkers board states and the

correct move for each• Indirect Feedback: Move sequences and final outcomes of various games played

• Credit assignment problem: Value of early states must be inferred from the outcome

• Degree to which the learner controls the sequence of training examples• Teacher selects informative boards and gives correct move• Learner proposes board states that it finds particularly confusing. Teacher provides correct

moves• Learner controls board states and (indirect) training classifications

CS 484 – Artificial Intelligence 28

Page 29: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Choosing the Training Experience (2)

• How well the training experience represents the distribution of examples over which the final system performance P will be measured• If training the checkers program consists only of experiences played against

itself, it may never encounter crucial board states that are likely to be played by the human checkers champion

• Most theory of machine learning rests on the assumption that the distribution of training examples is identical to the distribution of test examples

CS 484 – Artificial Intelligence 29

Page 30: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Partial Design of Checkers Learning Program

• A checkers learning problem:• Task T: playing checkers

• Performance measure P: percent of games won in the world tournament

• Training experience E: games played against itself

• Remaining choices• The exact type of knowledge to be learned

• A representation for this target knowledge

• A learning mechanism

CS 484 – Artificial Intelligence 30

Page 31: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Choosing the Target Function (1)

• Assume that you can determine legal moves

• Program needs to learn the best move from among legal moves• Defines large search space known a priori• target function: ChooseMove : B → M

• ChooseMove is difficult to learn given indirect training

• Alternative target function• An evaluation function that assigns a numerical score to any given

board state• V : B → ( where is the set of real numbers)

• V(b) for an arbitrary board state b in B• if b is a final board state that is won, then V(b) = 100• if b is a final board state that is lost, then V(b) = -100• if b is a final board state that is drawn, then V(b) = 0• if b is not a final state, then V(b) = V(b '), where b' is the best final board

state that can be achieved starting from b and playing optimally until the end of the game

CS 484 – Artificial Intelligence 31

Page 32: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Choosing the Target Function (2)

• V(b) gives a recursive definition for board state b• Not usable because not efficient to compute except is first three trivial cases

• nonoperational definition

• Goal of learning is to discover an operational description of V

• Learning the target function is often called function approximation• Referred to as

CS 484 – Artificial Intelligence 32

Page 33: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Choosing a Representation for the Target Function• Choice of representations involve trade offs

• Pick a very expressive representation to allow close approximation to the ideal target function V• More expressive, more training data required to choose among alternative hypotheses

• Use linear combination of the following board features:• x1: the number of black pieces on the board• x2: the number of red pieces on the board• x3: the number of black kings on the board• x4: the number of red kings on the board• x5: the number of black pieces threatened by red (i.e. which can be captured on red's next turn)• x6: the number of red pieces threatened by black

6655443322110)(ˆ xwxwxwxwxwxwwbV

CS 484 – Artificial Intelligence 33

Page 34: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Partial Design of Checkers Learning Program

• A checkers learning problem:• Task T: playing checkers

• Performance measure P: percent of games won in the world tournament

• Training experience E: games played against itself

• Target Function: V: Board →

• Target function representation

6655443322110)(ˆ xwxwxwxwxwxwwbV

CS 484 – Artificial Intelligence 34

Page 35: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Choosing a Function Approximation Algorithm

• To learn we require a set of training examples describing the board b and the training value Vtrain(b)• Ordered pair V̂

bVb train,

100,0,0,0,1,0,3 654321 xxxxxx

CS 484 – Artificial Intelligence 35

Page 36: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Estimating Training Values

• Need to assign specific scores to intermediate board states

• Approximate intermediate board state b using the learner's current approximation of the next board state following b

• Simple and successful approach

• More accurate for states closer to end states

))((ˆ)( bSuccessorVbVtrain

CS 484 – Artificial Intelligence 36

Page 37: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Adjusting the Weights

• Choose the weights wi to best fit the set of training examples

• Minimize the squared error E between the train values and the values predicted by the hypothesis

• Require an algorithm that• will incrementally refine weights as new training examples become

available

• will be robust to errors in these estimated training values

• Least Mean Squares (LMS) is one such algorithm

examplestrainingbVb

train

train

bVbVE ,

CS 484 – Artificial Intelligence 37

Page 38: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

LMS Weight Update Rule

• For each train example• Use the current weights to calculate• For each weight wi, update it as

• where• is a small constant (e.g. 0.1)

bVb train,

bV̂

itrainii xbVbVww ˆ

CS 484 – Artificial Intelligence 38

Page 39: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Final Design

Experiment

Generator

Performance

System

Critic

Generalizer

HypothesisNew problem

(initial game board)

Solution trace

(game history)

Training examples

,,,, 2211 bVbbVb traintrain

CS 484 – Artificial Intelligence 39

Page 40: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Summary of Design Choices

Determine Type

of Training Experience

Determine

Target Function

Determine Representation

of Learned Function

Determine

Learning Algorithm

Complete Design

Games against itself Table of

correct Moves

Games against

Experts

Board → valueBoard → move…

Linear function of six features

PolynomialArtificial neural

network…

Gradient descent Linear Programming …40

Page 41: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Training Classification Problems

• Many learning problems involve classifying inputs into a discrete set of possible categories.

• Learning is only possible if there is a relationship between the data and the classifications.

• Training involves providing the system with data which has been manually classified.

• Learning systems use the training data to learn to classify unseen data.

CS 484 – Artificial Intelligence 41

Page 42: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Rote learning

• A very simple learning method.

• Simply involves memorizing the classifications of the training data.

• Can only classify previously seen data – unseen data cannot be classified by a rote learner.

CS 484 – Artificial Intelligence 42

Page 43: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Rote learning is learning without understanding the meaning of what is learned.

• For example, you can learn to make the correct response to a stimulus without discovering the conceptual category to which the stimulus belongs.

• More technically, you make the correct response without detecting the attributes that the stimulus shares with other members of the conceptual class.

Dr. G. Bharadwaja Kumar, VIT Chennai 43

Page 44: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• The next time you see that example, you may give the correct response.

• But what if you are given an example you haven’t seen before?

Dr. G. Bharadwaja Kumar, VIT Chennai 44

Page 45: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Concept Learning

Dr. G. Bharadwaja Kumar, VIT Chennai 45

Page 46: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Concept Learning

• Concept learning involves determining a mapping from a set of input variables to a Boolean value.

• Such methods are known as inductive learning methods.

• If a function can be found which maps training data to correct classifications, then it will also work well for unseen data – hopefully!

• This process is known as generalization.

CS 484 – Artificial Intelligence 46

Page 47: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Concept Learning

• Concepts are categories of stimuli that have certain features in common.

• The shapes on the right are all members of a conceptual

category: rectangle. Their common features are (1) 4 lines;

(2) opposite lines parallel; (3) lines connected at ends;

(4) lines form 4 right angles.

• The fact that they are different colors and sizes and have

different orientations is irrelevant. Color, size, and orientation

are not defining features of the concept

Dr. G. Bharadwaja Kumar, VIT Chennai 47

Page 48: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• If a stimulus is a member of a specified conceptual category,

it is referred to as a “positive instance”. If it is not a member,

it is referred to as “negative instance”. These are all negative

instances of the rectangle concept:

• As rectangles are defined, a stimulus is a negative instance if it

lacks any one of the specified features.

Dr. G. Bharadwaja Kumar, VIT Chennai 48

Page 49: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Every concept has two components:

• Attributes: These are features of a stimulus that one must look

for to decide if that stimulus is a positive instance of the concept.

• A rule: This a statement that specifies which attributes must be present or absent for a stimulus to qualify as a positive instance

of the concept.

• For rectangles, the attributes would be the four features

discussed earlier, and the rule would be that all the attributes

must be present.

Dr. G. Bharadwaja Kumar, VIT Chennai 49

Page 50: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• The simplest rules refer to the presence or absence of a single attribute. For example, a “vertebrate” animal is defined as an animal with a backbone. Which of these stimuli are positive instances?

• This rule is called affirmation. It says that a stimulus must possess a single specified attribute to qualify as a positive instance of a concept.

Dr. G. Bharadwaja Kumar, VIT Chennai 50

+ + +_

Page 51: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• The opposite or “complement” of affirmation is is negation. To qualify as a positive instance, a stimulus must lack a single specified attribute.

• An invertebrate animal is one that lacks a backbone. These are the positive and negative instances when the negation rule is applied.

Dr. G. Bharadwaja Kumar, VIT Chennai 51

+__ _

Page 52: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Concept Learning

• In behavioral terms, when a concept is learned, two processes control how we respond to a stimulus:

• Generalization: We generalize a certain response (like the name of an object) to all members of the conceptual class based on their common attributes.

• Discrimination: We discriminate between stimuli which belong to the conceptual class and those that don’t because they lack one or more of the defining attributes.

Dr. G. Bharadwaja Kumar, VIT Chennai 52

Page 53: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Concept Learning Behavioral

Processes

For example, we generalize the word “rectangle” to those stimuli that possess the defining attributes...

...and discriminate between these stimuli and others that are outside the conceptual class, in which case we respond with a different word:

rectangle rectangle rectangle

?

Page 54: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Perspectives and Issues

Dr. G. Bharadwaja Kumar, VIT Chennai 54

Page 55: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Data is cheap and abundant ; knowledge is expensive and scarce.

• Build a model that is a good and useful approximation to the data.

Dr. G. Bharadwaja Kumar, VIT Chennai 55

Page 56: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

When are ML algorithms NOT needed?

❖When the relationships between all system variables (input, output, and hidden) is completely understood!

❖This is NOT the case for almost any real system!

Dr. G. Bharadwaja Kumar, VIT Chennai 56

Page 57: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

What is needed?

• When solving a machine learning problem we must be sure to identify:

• What task is to be learned?

• How do we (will we) test the performance of our system?

• What knowledge do we want to learn?

• How do we represent this knowledge?

• What learning paradigm would be best to use?

• How do we construct a training experience for our learner?

Dr. G. Bharadwaja Kumar, VIT Chennai 57

Page 58: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Why Machine Learning

• Human expertise does not exist (navigating on Mars),

• Humans are unable to explain their expertise (speech recognition)

• Solution changes in time (routing on a computer network)

• Solution needs to be adapted to particular cases (user biometrics)

• Needs to identify hidden relationships and correlations within large amounts of data

• Human designers often produce machines that do not work as desired in the environments in which they are not used.

Dr. G. Bharadwaja Kumar, VIT Chennai 58

Page 59: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

❖ The amount of knowledge available about certain tasks might be too large for explicit encoding by humans (e.g., medical diagnostic).

❖ New knowledge about tasks is constantly being discovered by humans. It may be difficult to continuously re-design systems “by hand”.

Dr. G. Bharadwaja Kumar, VIT Chennai 59

Page 60: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Advantages of ML

➢ Alleviate Knowledge Acquisition Bottleneck• Does not require knowledge engineers

• Scalable in constructing knowledge base

➢ Adaptive• Adaptive to the changing conditions

• Easy in migrating to new domains

• Customizing themselves to individual users

Dr. G. Bharadwaja Kumar, VIT Chennai 60

Page 61: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

➢ Discover new knowledge from large databases (data mining).• Market basket analysis (e.g. diapers and beer)

• Medical text mining (e.g. migraines to calcium channel blockers to magnesium)

➢ To Engineer better Computing Systems

➢Build a model that is a good and useful approximation to the data.

➢Data is cheap and abundant (data warehouses, data marts); knowledge is expensive and scarce.

Dr. G. Bharadwaja Kumar, VIT Chennai 61

Page 62: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Issues with Machine Learning

❖What algorithms are available for learning a concept? How well do they perform?

❖How much training data is sufficient to learn a concept with high confidence?

❖When is it useful to use prior knowledge?

❖Are some training examples more useful than others?

❖What are best tasks for a system to learn?

Dr. G. Bharadwaja Kumar, VIT Chennai 62

Page 63: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

❖What is the best way for a system to represent its knowledge?

❖How can we optimize the accuracy on future data points?

❖Are some learning problems computationally tractable?

❖How can we formulate application problems as machine learning paradigms?

Dr. G. Bharadwaja Kumar, VIT Chennai 63

Page 64: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Paradigms of Machine Learning Algorithms

❖Learning algorithms fall into various paradigms withrespect to the sort of feedback that the learner hasaccess to.✓Supervised Learning✓Unsupervised Learning✓Semi Supervised Learning✓Reinforcement Learning

Dr. G. Bharadwaja Kumar, VIT Chennai 64

Page 65: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 65

Page 66: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 66

Page 67: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Supervised

❖For every input, the learner is provided with a target; that is, theenvironment tells the learner what its response should.

❖The learner then compares its actual response to the target andadjusts its internal memory in such a way that it is more likely toproduce the appropriate response the next time it receives the sameinput.

❖We can think of learning a simple categorization task as supervisedlearning.

Dr. G. Bharadwaja Kumar, VIT Chennai 67

Page 68: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 68

• Supervised learning: (a) presents a three-class labeled dataset, where the color shows the corresponding label of each sample. After supervised learning, the class-separating boundary could be found as the dotted lines in (b).

(a) (b)

Page 69: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Classification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

Test

Set

Training

SetModel

Learn

Classifier

Page 70: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Support Vector Machines

• logistic regression,

• linear discriminant analysis,

• decision trees

• k-nearest neighbor algorithm

• Neural Networks (Multilayer perceptron)

• naive Bayes etc.

Dr. G. Bharadwaja Kumar, VIT Chennai 70

Page 71: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Unsupervised

❖The learner receives no feedback from the world at all.

❖Instead the learner's task is to re-represent the inputs in a moreefficient way, as clusters or categories or using a reduced set ofdimensions.

❖Unsupervised learning is based on the similarities and differencesamong the input patterns. It does not result directly in differences inovert behavior because its "outputs" are really internalrepresentations.

Dr. G. Bharadwaja Kumar, VIT Chennai 71

Page 72: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 72(a) (b)

Page 73: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Clustering

•The goal of clustering is to

• group data points that are close (or similar) to each other

• identify such groupings (or clusters) in an unsupervised manner i.e. no information is provided to the algorithm on which data points belong to which clusters

Page 74: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Examples of Clustering

• Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts.• Tailor-made for each person: too expensive

• One-size-fits-all: does not fit all.

• Example 2: In marketing, segment customers according to their similarities• To do targeted marketing.

• Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics.

• Spatial Data Analysis

• E. g., land use, city planning, earth-quake studies

Page 75: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Neural network models (self-organizing map (SOM) and adaptive resonance theory (ART))

• Clustering (e.g., k-means, Gaussian mixture models, k-mode)

Dr. G. Bharadwaja Kumar, VIT Chennai 75

Page 76: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

❖Semi-supervised learning falls between unsupervised learning(without any labeled training data) and supervised learning (withcompletely labeled training data).

❖It makes use of both labeled and unlabeled data for training -typically a small amount of labeled data with a large amount ofunlabeled data.

Dr. G. Bharadwaja Kumar, VIT Chennai 76

Page 77: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 77

• Semi-supervised learning. (a) presents a labeled dataset (with red, green, and blue) together with a unlabeled dataset (marked with black). The distribution of the unlabeled dataset could guide the position of separating boundary.

(a) (b)

Page 78: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Semi-Supervised Support Vector Machines (S3VMs)

• Laplacian Regularized Least Squares (LapRLS)

• Semi-Supervised Random Forests

Dr. G. Bharadwaja Kumar, VIT Chennai 78

Page 79: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Reinforcement learning

❖The learner receives feedback about the appropriateness of itsresponse.

❖For correct responses it resembles supervised learning: However, thetwo forms of learning differ significantly for errors, situations in whichthe learner's behavior is in some way inappropriate.

❖ In these situations, supervised learning lets the learner know exactlywhat it should have done, whereas reinforcement learning only saysthat the behavior was inappropriate and (usually) how inappropriateit was.

Dr. G. Bharadwaja Kumar, VIT Chennai 79

Page 80: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 80

Page 81: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Q-Learning

• Temporal Difference Learning:

• Prioritized Sweeping

• Dynamic Bayesian Network-Markov Decision Process

Dr. G. Bharadwaja Kumar, VIT Chennai 81

Page 82: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Classification of ML Algorithms

❖Based on how data is available to learning algorithms or way learning happens✓Batch Learning✓Online Learning✓Instance Based Learning✓Incremental learning✓Deep Learning✓Evolutionary Learning✓Sequence learning

Dr. G. Bharadwaja Kumar, VIT Chennai 82

Page 83: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Batch Learning or Offline Learning

• Machine learning algorithms assume they have access to the entire training dataset at once.

• In general, Machine learning fall into this category.

• SVM, Neural Networks (MLP) etc.

Dr. G. Bharadwaja Kumar, VIT Chennai 83

Page 84: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Online Learning

• Data available in a sequential fashion at a very high rate i.e. data streams and not possible to store the data which enforces real time analysis

• Slightly different characteristics than time series data

• Examples of data streams include computer network traffic, web searches, and sensor data.

Dr. G. Bharadwaja Kumar, VIT Chennai 84

Page 85: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• VERY FAST DECISION TREE (VFDT)

• Concept-Adapting Very Fast Decision Tree (CVFDT)

• BIRCH

• STREAM

• CluStream

Dr. G. Bharadwaja Kumar, VIT Chennai 85

Page 86: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Sequence Learning

• Most machine learning algorithms are designed for independent, identically distributed (i.i.d.) data

• Sequence learning is the study of machine learning algorithms designed for sequential data. These algorithms should• not assume data points to be independent i.e. the data instances

are strongly correlated

• able to deal with sequential distortions

• make use of context information

Dr. G. Bharadwaja Kumar, VIT Chennai 86

Page 87: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Applications include speech recognition, gesture recognition, protein secondary structure prediction, handwriting recognition.

• Algorithms: Hidden Markov Models(HMM), Conditional Random Fields (CRF), Maximum-Entropy Markov model (MEMM),

Dr. G. Bharadwaja Kumar, VIT Chennai 87

Page 88: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Instance-based or Memory-based learning

• Instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory.

• K-nearest neighbor, Neural Network (RBF), Locally Weighted Regression

Dr. G. Bharadwaja Kumar, VIT Chennai 88

Page 89: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Incremental Learning

• Capable to learn and update with every new data – labeled or unlabeled.

Dr. G. Bharadwaja Kumar, VIT Chennai 89

Page 90: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Incremental SVM

• Incremental HMM

• Incremental Sigmoid Belief Networks (ISBNs)

• Incremental Decision Trees (ID5R)

Dr. G. Bharadwaja Kumar, VIT Chennai 90

Page 91: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Deep Learning

• Deep Learning is about learning multiple levels of representation and abstraction that help to make sense of data such as images, sound, and text.

• Deep Boltzmann Machine, Convolutional Deep Neural Networks, Deep Belief Networks

Dr. G. Bharadwaja Kumar, VIT Chennai 91

Page 92: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Displaying the structure of a set of documents using Latent Semantic Analysis (a form of PCA)

Each document is converted to a vector of word counts. This vector is then mapped to two coordinates and displayed as a colored dot. The colors represent the hand-labeled classes.

When the documents are laid out in 2-D, the classes are not used. So we can judge how good the algorithm is by seeing if the classes are separated.

Page 93: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Displaying the structure of a set of documents using a deep neural network

Page 94: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Regularization

• An extension made to the basic learning method that penalizes models based on their complexity, favoring simpler models that are also better at generalizing.

• The most popular regularization algorithms are:

• Ridge Regression

• Least Absolute Shrinkage and Selection Operator (LASSO)

• Elastic Net

• Least-Angle Regression (LARS)

Dr. G. Bharadwaja Kumar, VIT Chennai 94

Page 95: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Ensemble Algorithms

• Ensemble methods are models composed of multiple weaker models that are independently trained and whose predictions are combined in some way to make the overall prediction.

• Boosting• Bootstrapped Aggregation (Bagging)• AdaBoost• Stacked Generalization (blending)• Gradient Boosting Machines (GBM)• Gradient Boosted Regression Trees (GBRT)• Random Forest

Dr. G. Bharadwaja Kumar, VIT Chennai 95

Page 96: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dimensionality Reduction Algorithms

• This can be useful to reduce the number of features either to visualize dimensional data or simplify learning process

• Principal Component Analysis (PCA)

• Principal Component Regression (PCR)

• Partial Least Squares Regression (PLSR)

• Sammon Mapping

• Multidimensional Scaling (MDS)

• Projection Pursuit

• Linear Discriminant Analysis (LDA)

• Mixture Discriminant Analysis (MDA)

• Quadratic Discriminant Analysis (QDA)

• Flexible Discriminant Analysis (FDA)

Dr. G. Bharadwaja Kumar, VIT Chennai 96

Page 97: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Linear&

Non-Linear Separability

Dr. G. Bharadwaja Kumar, VIT Chennai 97

Page 98: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Linear Separability

• Let X0 and X1 be two sets of points in an n-dimensional Euclidean space. Then X0 and X1 are linearly separable if there exists n + 1 real numbers w1, w2,..,wn, k, such that every point x in X0

satisfies and every point x in X1 satisfies

where xi is the ith component of x.

Dr. G. Bharadwaja Kumar, VIT Chennai 98

𝑖=1

𝑛

𝑤𝑖𝑥𝑖 > 𝑘

𝑖=1

𝑛

𝑤𝑖𝑥𝑖 < 𝑘

Page 99: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Linearly separable data: if all the data points can be correctly classified by a linear decision boundary (line).

Dr. G. Bharadwaja Kumar, VIT Chennai 99

Page 100: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 100

These two sets are linearly separable because there exists a line in the plane with all of the blue points on one side of the line and all the red points on the other side. This idea immediately generalizes to higher-dimensional Euclidean spaces if line is replaced by hyper plane.

Page 101: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• If not linearly separable✓Allow some errors

✓Still, try to place hyperplane “far” from each class

Dr. G. Bharadwaja Kumar, VIT Chennai 101

Page 102: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Non-Linear Separability

Dr. G. Bharadwaja Kumar, VIT Chennai 102

Page 103: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Non Linear problem

Page 104: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 104

Page 105: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Linear Separability

• Linear or Non linear separable data?• We can find out only empirically

• Linear algorithms (algorithms that find a linear decision boundary)• When we think the data is linearly separable

• Advantages• Simpler, less parameters

• Disadvantages• High dimensional data is usually not linearly separable

• Examples: Perceptron, SVM

Dr. G. Bharadwaja Kumar, VIT Chennai 105

Page 106: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Nonlinear Separability

• It is well known that a non linear mapping from a small dimensional space into a high-dimensional space facilitates linear classification.

Dr. G. Bharadwaja Kumar, VIT Chennai 106

Page 107: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• In (a) a two-dimensional input space is depicted, in which the yellow spheres and the red stars cannot be separated with a single straight line.

• With a nonlinear mapping into a three-dimensional space, as depicted in (b), the spheres and stars can be separated by a single linear hyperplane.

Dr. G. Bharadwaja Kumar, VIT Chennai 107

Page 108: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 108

Page 109: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 109

Radial Basis Function (RBF) kernel in LIBSVM

Page 110: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Multiclass or multinomial classification

•Given: some data items that belong to one of M mutually-exclusive classes

•Task: Train the classifier and predict the class for a new data item

•Geometrically: harder problem, no more simple geometry

Dr. G. Bharadwaja Kumar, VIT Chennai 110

Page 111: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Multi-class classification

Dr. G. Bharadwaja Kumar, VIT Chennai 111

Page 112: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• For example, classifying a set of images of fruits which may be oranges, apples, or pears.

• Multiclass classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.

• The first category of algorithms include decision trees, neural networks, k-Nearest Neighbor, Naive Bayes classifiers.

Dr. G. Bharadwaja Kumar, VIT Chennai 112

Page 113: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• The basic SVM supports only binary classification, but extensions have been proposed to handle the multiclass classification case as well.

Dr. G. Bharadwaja Kumar, VIT Chennai 113

Page 114: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• The multiclass classification problem can be decomposed into several binary classification tasks that can be solved efficiently using binary classifiers.

• The most successful and widely used binary classifers are the Support Vector Machines. The idea is similar to that of using codewords for each class and then using a number of binary classifiers in solving several binary classification problems, whose results can determine the class label for new data.

Dr. G. Bharadwaja Kumar, VIT Chennai 114

Page 115: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

One-versus-all (OVA)

• The simplest approach is to reduce the problem of classifying among K classes into K binary problems, where each problem discriminates a given class from the other K-1 classes

• When testing an unknown example, the classifier producing the maximum ouput is considered the winner, and this class label is assigned to that example

Dr. G. Bharadwaja Kumar, VIT Chennai 115

Page 116: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

All-versus-all (AVA)

• In this approach, each class is compared to each other class . A binary classifier is built to discriminate between each pair of classes, while discarding the rest of the classes. This requires building K(K−1)/2 binary classifiers.

• When testing a new example, a voting is performed among the classifiers and the class with the maximum number of votes wins

Dr. G. Bharadwaja Kumar, VIT Chennai 116

Page 117: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Error-Correcting Output-Codes

• Output-code based strategies are fairly different from one-vs-the-rest and one-vs-one. With these strategies, each class is represented in a euclidean space, where each dimension can only be 0 or 1. Another way to put it is that each class is represented by a binary code (an array of 0 and 1). The matrix which keeps track of the location/code of each class is called the code book. The code size is the dimensionality of the aforementioned space. Intuitively, each class should be represented by a code as unique as possible and a good code book should be designed to optimize classification accuracy.

Dr. G. Bharadwaja Kumar, VIT Chennai 117

Page 118: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 118

Page 119: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Multi-Label Classification

•Given: some data items that belong to more than one class of M possible classes

• Task: Train the classifier and predict the class for a new data item

•Geometrically: harder problem, no more simple geometry

Dr. G. Bharadwaja Kumar, VIT Chennai 119

Page 120: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Multi-label classification: Examples• Language identification

• Text categorization (topics)

• For instance, an article in a newspaper may be assigned to the categories POLITICS, SPORTS, RELIGION, etc.

Page 121: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Some classification algorithms/models have been adapted to the multi-label task, without requiring problem transformations. Examples of these include:

• boosting: AdaBoost.MH and AdaBoost.MR are extended versions of AdaBoost for multi-label data.

• k-nearest neighbors: the ML-kNN algorithm extends the k-NN classifier to multi-label data.

• decision trees: "Clare" is an adapted C4.5 algorithm for multi-label classification; the modification involves the entropy calculations.

• neural networks: BP-MLL is an adaptation of the popular back-propagation algorithm for multi-label learning.

Dr. G. Bharadwaja Kumar, VIT Chennai 121

Page 122: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Various binary classifiers have been developed over time and there is no clear winner as to which classifier performs the best. Different classifiers perform differently depending on the number of observations, the dimensionality of the feature vector, the noise in the data and various other factors. For e.g. random forests perform better than SVM classifiers for 3D point clouds

Dr. G. Bharadwaja Kumar, VIT Chennai 122

Page 123: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

General Assumptions on Dataset

•In machine learning, an unknown universal dataset is assumed to exist, which contains all the possible data pairs as well as their probability distribution of appearance in the real world.

Dr. G. Bharadwaja Kumar, VIT Chennai 123

Page 124: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• While in real applications, what we observed is only a subset of the universal dataset due to the lack of memory or some other unavoidable reasons.

• This acquired dataset is called the training set (training data) and used to learn the properties and knowledge of the universal dataset.

• In general, vectors in the training set are assumed independently and identically sampled (i.i.d) from the universal dataset

Dr. G. Bharadwaja Kumar, VIT Chennai 124

Page 125: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 125

Page 126: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Bias – Variance Tradeoff

Dr. G. Bharadwaja Kumar, VIT Chennai 126

Page 127: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• The bias-variance tradeoff is an important aspect of data science projects based on machine learning.

• Any learning algorithm that use mathematical or statistical models whose “error” can be split into two main components: reducible and irreducible error.

• Irreducible error or inherent uncertainty is associated with a natural variability in a system. On the other hand, reducible error, as the name suggests, can be and should be minimized further to maximize accuracy.

Dr. G. Bharadwaja Kumar, VIT Chennai 127

Page 128: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 128

• Suppose our outcome variable is Y and covariates are X, we may assume that there is a relationship relating one to the other such as Y=f(X)+ϵ where the error term ϵ is normally distributed with a mean of zero like so ϵ∼N(0,σϵ)

Page 129: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 129

We may estimate a model f^(x) of f(X) using linear regressions or another modeling technique. In this case, the expected squared prediction error at a point x is:

Page 130: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• That third term, irreducible error, is the noise term in the true relationship that cannot fundamentally be reduced by any model. Given the true model and infinite data to calibrate it, we should be able to reduce both the bias and variance terms to 0.

• However, in a world with imperfect models and finite data, there is a tradeoff between minimizing the bias and minimizing the variance.

Dr. G. Bharadwaja Kumar, VIT Chennai 130

Page 131: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Reducible error can be further decomposed into “error due to bias” and “error due to variance.” The data scientist’s goal is to simultaneously reduce bias and variance as much as possible in order to obtain as accurate model as is feasible.

• However, there is a tradeoff to be made when selecting models of different flexibility or complexity and in selecting appropriate training sets to minimize these sources of error!

Dr. G. Bharadwaja Kumar, VIT Chennai 131

Page 132: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• The bias is error from erroneous assumptions in the learning algorithm or model mismatch. High bias can cause an algorithm to miss the relevant relations between features and target outputs (under fitting).

• The variance is error from variation due to train sample and randomization. High variance can cause (over fitting): modeling the random noise in the training data, rather than the intended outputs.

Dr. G. Bharadwaja Kumar, VIT Chennai 132

Page 133: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data.

• Unfortunately, it is typically impossible to do both simultaneously.

• High-variance learning methods may be able to represent their training set well, but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that don't tend to overfit, but may underfit their training data, failing to capture important regularities.

Dr. G. Bharadwaja Kumar, VIT Chennai 133

Page 134: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Error due to Bias

• The error due to bias is taken as the difference between the expected (or average obtained from cross-validation) prediction of our model and the correct value which we are trying to predict.

• Bias measures how far off in general these models' predictions are from the correct value. If these average prediction values are substantially different that the true value, bias will be high.

Dr. G. Bharadwaja Kumar, VIT Chennai 134

Page 135: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Error due to Variance

• The error due to variance is the amount by which the prediction, over one training set, differs from the expected predicted value, over all the training sets.

Dr. G. Bharadwaja Kumar, VIT Chennai 135

Page 136: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Graphical Visualization of bias and variance using a bulls-eye diagram

Dr. G. Bharadwaja Kumar, VIT Chennai 136

Page 137: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 137

Page 138: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 138

Page 139: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 139

Page 140: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• In fact, your linear model is underfitting the nonlinear target function over the training set. Likewise, if your target truth is linear, and you select a nonlinear model to approximate it, then you’re introducing a bias resulting from the nonlinear model’s inability to be linear where it needs to be. In fact, the nonlinear model is overfitting the linear target function over the training set.

Dr. G. Bharadwaja Kumar, VIT Chennai 140

Page 141: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

bias vs. variance

high biaslow variance

medium biasmedium variance

low biashigh variance

Page 142: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Bias-variance tradeoff

Dr. G. Bharadwaja Kumar, VIT Chennai 142

Training error

Test error

Underfitting Overfitting

Complexity Low Bias

High Variance

High Bias

Low Variance

Err

or

Page 143: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Bias-variance tradeoff

Dr. G. Bharadwaja Kumar, VIT Chennai 143

Many training examples

Few training examples

Complexity Low Bias

High Variance

High Bias

Low Variance

Test E

rror

Page 144: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Effect of Training Size

Dr. G. Bharadwaja Kumar, VIT Chennai 144

Page 145: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Effect of Model Complexity

Dr. G. Bharadwaja Kumar, VIT Chennai 145

Page 146: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Computational Learning Theory

Dr. G. Bharadwaja Kumar, VIT Chennai 146

Page 147: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Learning Theory

•COLT helps to define the class of learnable concepts in terms of computational complexityi.e. the time and space complexity of the learning algorithm, which depends on the cost of the computational representation of the concepts and sample complexity, i.e. the number of training examples necessary or sufficient to learn hypotheses of a given accuracy.

Dr. G. Bharadwaja Kumar, VIT Chennai 147

Page 148: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

•A good hypothesis is a productive one. A productive hypothesis can:• Be easily learned and applied• Explain the past accurately and persuasively• Make accurate predictions about the future• Generate new even more useful hypotheses• Be applied to a wide variety of situations• Be easily tested

Dr. G. Bharadwaja Kumar, VIT Chennai 148

Page 149: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Learning in the limit: Is the learner guaranteed to converge to the correct hypothesis in the limit as the number of training examples increases indefinitely?

• Sample Complexity: Can one characterize the number of training examples necessary/sufficient for highly accurate learning?

• Computational Complexity: How much computational resources (time and space) are needed for a learner to learn a highly accurate hypothesis?

• Is it possible to identify classes of concepts that are inherently difficult/easy to learn, independent of the learning algorithm?

Dr. G. Bharadwaja Kumar, VIT Chennai 149

Page 150: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Mistake Bound: how many training examples will the learner misclassify before constructing a highly accurate concept

Dr. G. Bharadwaja Kumar, VIT Chennai 150

Page 151: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Two frameworks for analyzing learning algorithms

• Probably Approximately Correct (PAC)framework• Identify classes of hypotheses that can/cannot be learned from a polynomial

number of training samples• Finite hypothesis space

• Infinite hypotheses (VC dimension)

• Define natural measure of complexity for hypothesis spaces (VC dimension)that allows bounding the number of training examples required for inductive learning

• Mistake bound framework• Number of training errors made by a learner before it determines correct

hypothesis

Dr. G. Bharadwaja Kumar, VIT Chennai 151

Page 152: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

PAC Learning

Dr. G. Bharadwaja Kumar, VIT Chennai 152

Page 153: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• PAC Model• Only requires learning a Probably Approximately Correct Concept: Learn a

decent approximation most of the time.

• Requires polynomial sample complexity and computational complexity.

Dr. G. Bharadwaja Kumar, VIT Chennai 153

Page 154: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• PAC-learnability is most determined by the number of training examples required by the learner. The sample complexity of the learning problem is the growth in the number of required training examples with problem size.

• This is because in practical settings the most limiting factor of the learner is the number of training examples available.

Dr. G. Bharadwaja Kumar, VIT Chennai 154

Page 155: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• If we characterize the number of training examples needed to learn a hypothesis h for which error = 0. Unfortunately, it turns out this is futile in the setting we are considering, for two reasons.

• First, unless we provide training examples corresponding to every possible instance in X (an unrealistic assumption), there may be multiple hypotheses consistent with the provided training examples, and the learner cannot be certain to pick the one corresponding to the target concept.

• Second, given that the training examples are drawn randomly, there will always be some nonzero probability that the training examples encountered by the learner will be misleading.

Dr. G. Bharadwaja Kumar, VIT Chennai 155

Page 156: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• To accommodate these two difficulties, we weaken our demands on the learner in two ways.

• First, we will not require that the learner output a zero error hypothesis-we will require only that its error be bounded by some constant, , that can be made arbitrarily small. Second, we will not require that the learner succeed for every sequence of randomly drawn training examples-we will require only that its probability of failure be bounded by some constant, , that can be made arbitrarily small.

Dr. G. Bharadwaja Kumar, VIT Chennai 156

Page 157: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• The only reasonable expectation of a learner is that with high probability it learns a close approximation to the target concept.

• In the PAC model, we specify two small parameters, εand δ, and require that with probability at least (1 - δ) a system learn a concept with error at most ε .

Dr. G. Bharadwaja Kumar, VIT Chennai 157

Page 158: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Chernoff‐Hoeffding bound

• If you average a bunch of bounded random variables, then the probability this average random variable deviates from its expectation is exponentially small in the amount of deviation.

• Let be independent random variables whose values are in the range [0,1] . let

, ,

Then for all ε > 0 ,

Dr. G. Bharadwaja Kumar, VIT Chennai 158

𝑿𝟏, 𝑿𝟐, . . . , 𝑿𝒎

𝑿 =𝒊𝑿𝒊 𝝁𝒊 = 𝑬 𝑿𝒊 𝝁 = 𝑬 𝑿 =

𝒊𝝁𝒊

𝐏𝐫(|𝑿 − 𝝁| > 𝜺) ≤ 𝟐𝒆−𝟐 Τ𝜺𝟐 𝒎

Page 159: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• One nice thing about the Chernoff bound is that it doesn’t matter how the variables are distributed.

• This is important because in PAC we need guarantees that hold for any distribution generating data.

• Indeed, in this case the random variables above will be individual examples drawn from the distribution generating the data.

Dr. G. Bharadwaja Kumar, VIT Chennai 159

Page 160: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• We’ll be estimating the probability that our hypothesis has error deviating more than ε, and we want to bound this by , as in the definition of PAC‐learning.

• Since the amount of deviation (error ε) and the number of samples (m ) both occur in the exponent, the trick is in balancing the two values to get what we want.

Dr. G. Bharadwaja Kumar, VIT Chennai 160

Page 161: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• A algorithm that efficiently finds a consistent hypothesis will PAC‐learn any finite concept class provided it has at least samples, where

Dr. G. Bharadwaja Kumar, VIT Chennai 161

𝑚 ≥ Τ1 휀 𝑙𝑜𝑔|𝐻| + 𝑙𝑜𝑔 Τ1 𝛿

Page 162: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 162

Page 163: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 163

Page 164: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 164

Page 165: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 165

Page 166: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Inductive Learning Hypothesis

•Any hypothesis found to approximate the target function well over the training examples, will also approximate the target function well over the unobserved examples.

• find the hypothesis that best fits the training data

Dr. G. Bharadwaja Kumar, VIT Chennai 166

Page 167: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Contains all plausible versions of the target concept

Dr. G. Bharadwaja Kumar, VIT Chennai 167

Page 168: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• A hypothesis h is consistent with training examples D

iff h(x)=c(x) for each example <x, c(x)>in D

• •Version space with respect to

• hypothesis H and

• training examples D,

• is a subset of hypotheses from H consistent with the training examples in D

Dr. G. Bharadwaja Kumar, VIT Chennai 168

Page 169: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• VC dimension ( Vapnik–Chervonenkis dimension) is a measure of the capacity(complexity, expressive power, richness, or flexibility) of a classification algorithm, defined as the cardinality of the largest set of points that the algorithm can shatter.

Dr. G. Bharadwaja Kumar, VIT Chennai 169

Page 170: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 170

Page 171: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

The probabilistic guarantee

where N = size of training set

h = VC dimension of the model class = complexity

p = upper bound on probability that this bound fails

So if we train models with different complexity, we should pick the one that minimizes this bound

Actually, this is only sensible if we think the bound is fairly tight, which it usually isn’t. The theory provides insight, but in practice we still need some witchcraft.

2

1

)4/log()/2log(

N

phNhhEE traintest

Page 172: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

A simple example: Fitting a polynomial

• The green curve is the true function (which is not a polynomial)

• The data points are uniform in x but have noise in y.

• We will use a loss function that measures the squared error in the prediction of y(x) from x. The loss for the red polynomial is the sum of the squared vertical errors.

from Bishop

Page 173: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Some fits to the data: which is best?

from Bishop

Page 174: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

A simple way to reduce model complexity

• If we penalize polynomials that have big values for their coefficients, we will get less wiggly solutions:

2

1

||||}),({)(~

2

2

2

1www

nn txyE

N

n

regularization parameter

target value

penalized loss function

from Bishop

Page 175: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Regularization: vs.

Page 176: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Polynomial Coefficients

Page 177: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Generalization

• The real aim of Machine Learning is to do well on test data that is not known during learning.

Dr. G. Bharadwaja Kumar, VIT Chennai 177

Page 178: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Generalization

• Choosing the values for the parameters that minimize the loss function on the training data is not necessarily the best policy.

• We want the learning machine to model the true regularities in the data and to ignore the noise in the data. • But the learning machine does not know which

regularities are real and which are accidental quirks of the particular set of training examples we happen to pick.

• So how can we be sure that the machine will generalize correctly to new data?

Dr. G. Bharadwaja Kumar, VIT Chennai 178

Page 179: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Generalization

• One can only say a model is generalized well if it explains the data surprisingly well given the complexity of the model.

• If the model has as many degrees of freedom as the data, it can fit the data perfectly but so what?

• There is a lot of theory about how to measure the model complexity and how to control it to optimize generalization.

Dr. G. Bharadwaja Kumar, VIT Chennai 179

Page 180: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 180

Page 181: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Generative vs. Discriminative

Dr. G. Bharadwaja Kumar, VIT Chennai 181

Page 182: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Generative vs. Discriminative

• Generative models: specify a joint probability distribution over observation and label sequences i.e. full probabilistic model of all variables• Model class-conditional pdfs and prior probabilities

• Discriminative models: provide a model only for the target variable(s) conditional on the observed variables.• Directly estimate posterior probabilities

• No attempt to model underlying probability distributions

Dr. G. Bharadwaja Kumar, VIT Chennai 182

Page 183: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Generative vs. Discriminative

Dr. G. Bharadwaja Kumar, VIT Chennai 183

Discriminative Genarative

Page 184: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Generative Methods

• ☺ Relatively straightforward to characterize invariances

• ☺ They can handle partially labelled data

• They wastefully model variability which is unimportant for classification

• They scale badly with the number of classes and the number of invariant transformations

• Slow on test data

• higher asymptotic error

Dr. G. Bharadwaja Kumar, VIT Chennai 184

Page 185: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Discriminative Methods

• ☺ They can be very fast once trained

• ☺ lower asymptotic error

• inherently supervised, cannot deal with unlabelled data

• They interpolate between training examples, and hence can fail if novel inputs are presented

• They don’t easily handle compositionality

Dr. G. Bharadwaja Kumar, VIT Chennai 185

Page 186: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Generative Vs. Discriminative

Generative• Naïve Bayes

• Mixtures of Gaussians

• Hidden Markov Models (HMM)

• Bayesian networks

• Markov random fields

Discriminative• Logistic regression

• SVMs

• Neural networks (MLP, RBF)

• Nearest neighbor

• Conditional Random Fields (CRF)

Dr. G. Bharadwaja Kumar, VIT Chennai 186

Page 187: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Parametric vs non-parametric models

• Parametric models assume some finite set of parameters θ.

• Given the parameters, future predictions, x, are independent of the observed data, D:

P(x|θ, D) = P(x|θ)

• therefore θ capture everything there is to know about the data. • So the complexity of the model is bounded even if the amount of data is unbounded. This makes them not very flexible

Dr. G. Bharadwaja Kumar, VIT Chennai 187

Page 188: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Non-parametric model is not none-parametric: parameters are determined by the training data, not the model.

• Non-parametric models assume that the data distribution cannot be defined in terms of such a finite set of parameters. But they can often be defined by assuming an infinite dimensional θ. Usually we think of θ as a function.

• The amount of information that θ can capture about the data D can grow as the amount of data grows. This makes them more flexible.

Dr. G. Bharadwaja Kumar, VIT Chennai 188

Page 189: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Monte Carlo Hidden Markov Models: non-parametric hidden Markov models with continuous state and observation spaces.

• Based on the Dirichlet Process, a nonparametric Bayesian Hidden Markov Model is proposed, which allows an infinite number of hidden states and uses an infinite number of Gaussian components to support continuous observations

Dr. G. Bharadwaja Kumar, VIT Chennai 189

Page 190: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Parameters to learn

• Two kinds of parameters

• one the user sets for the training procedure in advance –hyperparameter• the degree of polynomial to match in regression• number/size of hidden layer in Neural Network• number of instances per leaf in decision tree

• one that actually gets optimized through the training – parameter• regression coefficients• network weights• size/depth of decision tree

• we usually do not talk about the latter, but refer to hyperparametersas parameters

Dr. G. Bharadwaja Kumar, VIT Chennai 190

Page 191: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Dr. G. Bharadwaja Kumar, VIT Chennai 191

Page 192: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

•Machine Learning algorithms rely on three components:

✓ Representation

✓ Optimization

✓ Evaluation

Dr. G. Bharadwaja Kumar, VIT Chennai 192

Page 193: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Model Selection

Dr. G. Bharadwaja Kumar, VIT Chennai 193

Page 194: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• The Akaike Information Criterion (AIC) is a way of selecting a model from a set of models. The chosen model is the one that minimizes the Kullback-Leibler distance between the model and the truth. It’s based on information theory, but a heuristic way to think about it is as a criterion that seeks a model that has a good fit to the truth but few parameters. It is defined as:

AIC = -2 ( ln ( likelihood )) + 2 K

where likelihood is the probability of the data given a model and K is the number of free parameters in the model.

Dr. G. Bharadwaja Kumar, VIT Chennai 194

Page 195: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• The second order information criterion, often called AICc, takes into account sample size by, essentially, increasing the relative penalty for model complexity with small data sets. It is defined as:

AICc = -2 ( ln ( likelihood )) + 2 K * (n / ( n – K – 1))

where n is the sample size. As n gets larger, AICc converges to AIC ( n – K – 1 n as n gets much bigger than K, and so (n / ( n – K – 1)) approaches 1), and so there’s really no harm in always using AICcregardless of sample size.

Dr. G. Bharadwaja Kumar, VIT Chennai 195

Page 196: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Minimum Description Length

• MDL is an information-theoretic approach to machine learning, or statistical model selection

• Basically says you should pick the model which gives you the most compact description of the data, including the description of the model itself.

• Provides a criterion for the selection of models, regardless of their complexity, without the restrictive assumption that the data form a sample from a 'true' distribution.

Dr. G. Bharadwaja Kumar, VIT Chennai 196

Page 197: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• Hence you really want to minimize the combined length of the description of the model, plus the description of the data under that model.

• The fundamental idea behind the MDL Principle is that any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally

Dr. G. Bharadwaja Kumar, VIT Chennai 197

Page 198: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

198

No

Cannot Learn Exact Conceptsfrom Limited Data, Only Approximations

Learner Classifier

Positive

Negative

Yes

Page 199: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

•A measure of the “power” or the “complexity” of the hypothesis space •Higher VC dimension implies a more “expressive”

hypothesis space

•Shattering: A set of N points is shattered if there exists a hypothesis that is consistent with every classification of the N points

Dr. G. Bharadwaja Kumar, VIT Chennai 199

Page 200: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

VC Dimension

Dr. G. Bharadwaja Kumar, VIT Chennai 200

Page 201: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

No free lunch theorem

• All models are wrong, but some models are useful. —George Box

• Much of machine learning is concerned with devising different models to fit the data.

• We can use methods such as cross validation to empirically choose the best model for a particular problem.

• However, there is no universally best model — this is sometimes called the no free lunch theorem (Wolpert 1996).

Dr. G. Bharadwaja Kumar, VIT Chennai 201

Page 202: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

• The no free lunch rule for dataset: (a) is the training set we have, and (b), (c) are two test sets. As you can see, (c) has different sample distributions from (a) and (b), so we cannot expect that the properties learned from (a) to be useful in (c).

Dr. G. Bharadwaja Kumar, VIT Chennai 202

Page 203: Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

The End

Dr. G. Bharadwaja Kumar, VIT Chennai 203