Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations

Machine Learning

Dr. G. Bharadwaja Kumar

VIT Chennai

Important developments in Computer Science

1. Bio Metrics: Speaker verification, Face, iris, finger print

2. Finance: Credit scoring, fraud detection

3. Manufacturing: Optimization, troubleshooting

4. Medicine: Clinical diagnosis

5. Telecommunications: Quality of service optimization

6. Stock market forecasting

7. Hand written character recognition

8. Autonomous robot control

9. Spam email detection

10. ...

Dr. G. Bharadwaja Kumar, VIT Chennai 2


Watson - Jeopardy




Speech Recognition

Dr. G. Bharadwaja Kumar, VIT Chennai 7Apple Smart Watch with SIRI Apple Smart Phone with SIRI


Medical Diagnosis

• Assist in decision making with a large number of inputs and in stressful situations


Medical Diagnosis



12

ML Application: Loan Approvals

income debt married age approve deny

John Smith 200,000 0 yes 80

Peter White 60,000 1,000 no 30

Ann Clark 100,000 10,000 yes 40

Susan Ho 0 20,000 no 25

• Objects – people

• Classes – “approve”, “deny”

Biometrics





What is Artificial Intelligence ?

• Branch of computer science which • is the science of making machines do things that would require

intelligence if done by men (Minsky)

• is the exciting new effort to make computers think (Haugeland)

• is the study of the computations that make it possible to perceive, reason, and act (Winston)

• is the study of how to do things which at the moment people do better (Rich & Knight)

• “Artificial Intelligence” term was coined in 1956 by John McCarthy at MIT


• AI is an extensive field of Computer Science

• There are many sub-fields of AI• Machine Learning

• Natural Language processing

• Speech Recognition

• Computer Vision


Machine Learning : Definition

• The goal of machine learning is to build computer systems that can adapt and learn from example data and past experience & optimize their performance.”

• Tom Dietterich


Other Definitions

❖ Machine learning expedite computers to learn and improveautomatically using example data or past experience and enablecomputers to handle new situations.

❖ The field of Machine Learning seeks to answer the question “Howcan we build computer systems that automatically improve withexperience, and what are the fundamental laws that govern alllearning processes?”.


What is Machine Learning?

The complexity in traditional computer programming is in the code (programs that people write). In machine learning, algorithms (programs) are in principle simple and the complexity (structure) is in the data. Is there a way that we can automatically learn that structure? That is what is at the heart of machine learning.

-- Andrew Ng

That is, machine learning is the about the construction and study of systems that can learn from data. This is very different thantraditional computer programming.


What is learning?

• Abstracting & generalizing knowledge or patterns from the data

• Required Components• Identifying the exact type of knowledge to be

learned•Representation for this target knowledge•A learning mechanism


Why Machine Learning is Hard

You See Your ML Algorithm Sees

Well Posed Learning Problems

•A computer program is said to learn from experience E w.r.t. some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.• T: class of tasks that we want computer program to do• P: measure of performance for how well computer did• E: some experience (training data) program has with task


A checkers learning problem

• A checkers learning problem• T: playing checkers

• P: percent of games won against opponents

• E playing practice games against itself


Designing a Learning System

• Consider designing a program to learn to play checkers, with the goal of entering it in the world checkers tournament

• Requires the following sets• Choosing Training Experience

• Choosing the Target Function

• Choosing the Representation of the Target Function

• Choosing the Function Approximation Algorithm

CS 484 – Artificial Intelligence 27

Choosing the Training Experience (1)

• Will the training experience provide direct or indirect feedback?• Direct Feedback: system learns from examples of individual checkers board states and the

correct move for each• Indirect Feedback: Move sequences and final outcomes of various games played

• Credit assignment problem: Value of early states must be inferred from the outcome

• Degree to which the learner controls the sequence of training examples• Teacher selects informative boards and gives correct move• Learner proposes board states that it finds particularly confusing. Teacher provides correct

moves• Learner controls board states and (indirect) training classifications


Choosing the Training Experience (2)

• How well the training experience represents the distribution of examples over which the final system performance P will be measured• If training the checkers program consists only of experiences played against

itself, it may never encounter crucial board states that are likely to be played by the human checkers champion

• Most theory of machine learning rests on the assumption that the distribution of training examples is identical to the distribution of test examples


Partial Design of Checkers Learning Program

• A checkers learning problem:• Task T: playing checkers

• Performance measure P: percent of games won in the world tournament

• Training experience E: games played against itself

• Remaining choices• The exact type of knowledge to be learned

• A representation for this target knowledge

• A learning mechanism


Choosing the Target Function (1)

• Assume that you can determine legal moves

• Program needs to learn the best move from among legal moves• Defines large search space known a priori• target function: ChooseMove : B → M

• ChooseMove is difficult to learn given indirect training

• Alternative target function• An evaluation function that assigns a numerical score to any given

board state• V : B → ( where is the set of real numbers)

• V(b) for an arbitrary board state b in B• if b is a final board state that is won, then V(b) = 100• if b is a final board state that is lost, then V(b) = -100• if b is a final board state that is drawn, then V(b) = 0• if b is not a final state, then V(b) = V(b '), where b' is the best final board

state that can be achieved starting from b and playing optimally until the end of the game


Choosing the Target Function (2)

• V(b) gives a recursive definition for board state b• Not usable because not efficient to compute except is first three trivial cases

• nonoperational definition

• Goal of learning is to discover an operational description of V

• Learning the target function is often called function approximation• Referred to as

V̂


Choosing a Representation for the Target Function• Choice of representations involve trade offs

• Pick a very expressive representation to allow close approximation to the ideal target function V• More expressive, more training data required to choose among alternative hypotheses

• Use linear combination of the following board features:• x1: the number of black pieces on the board• x2: the number of red pieces on the board• x3: the number of black kings on the board• x4: the number of red kings on the board• x5: the number of black pieces threatened by red (i.e. which can be captured on red's next turn)• x6: the number of red pieces threatened by black

6655443322110)(ˆ xwxwxwxwxwxwwbV


Partial Design of Checkers Learning Program

• A checkers learning problem:• Task T: playing checkers

• Performance measure P: percent of games won in the world tournament

• Training experience E: games played against itself

• Target Function: V: Board →

• Target function representation

6655443322110)(ˆ xwxwxwxwxwxwwbV


Choosing a Function Approximation Algorithm

• To learn we require a set of training examples describing the board b and the training value Vtrain(b)• Ordered pair V̂

bVb train,

100,0,0,0,1,0,3 654321 xxxxxx


Estimating Training Values

• Need to assign specific scores to intermediate board states

• Approximate intermediate board state b using the learner's current approximation of the next board state following b

• Simple and successful approach

• More accurate for states closer to end states

))((ˆ)( bSuccessorVbVtrain


Adjusting the Weights

• Choose the weights wi to best fit the set of training examples

• Minimize the squared error E between the train values and the values predicted by the hypothesis

• Require an algorithm that• will incrementally refine weights as new training examples become

available

• will be robust to errors in these estimated training values

• Least Mean Squares (LMS) is one such algorithm

examplestrainingbVb

train

train

bVbVE ,

2ˆ


LMS Weight Update Rule

• For each train example• Use the current weights to calculate• For each weight wi, update it as

• where• is a small constant (e.g. 0.1)

bVb train,

bV̂

itrainii xbVbVww ˆ


Final Design

Experiment

Generator

Performance

System

Critic

Generalizer

HypothesisNew problem

(initial game board)

Solution trace

(game history)

Training examples

V̂

,,,, 2211 bVbbVb traintrain


Summary of Design Choices

Determine Type

of Training Experience

Determine

Target Function

Determine Representation

of Learned Function

Determine

Learning Algorithm

Complete Design

Games against itself Table of

correct Moves

Games against

Experts

…

Board → valueBoard → move…

Linear function of six features

PolynomialArtificial neural

network…

Gradient descent Linear Programming …40

Training Classification Problems

• Many learning problems involve classifying inputs into a discrete set of possible categories.

• Learning is only possible if there is a relationship between the data and the classifications.

• Training involves providing the system with data which has been manually classified.

• Learning systems use the training data to learn to classify unseen data.


Rote learning

• A very simple learning method.

• Simply involves memorizing the classifications of the training data.

• Can only classify previously seen data – unseen data cannot be classified by a rote learner.


• Rote learning is learning without understanding the meaning of what is learned.

• For example, you can learn to make the correct response to a stimulus without discovering the conceptual category to which the stimulus belongs.

• More technically, you make the correct response without detecting the attributes that the stimulus shares with other members of the conceptual class.


• The next time you see that example, you may give the correct response.

• But what if you are given an example you haven’t seen before?


Concept Learning


Concept Learning

• Concept learning involves determining a mapping from a set of input variables to a Boolean value.

• Such methods are known as inductive learning methods.

• If a function can be found which maps training data to correct classifications, then it will also work well for unseen data – hopefully!

• This process is known as generalization.


Concept Learning

• Concepts are categories of stimuli that have certain features in common.

• The shapes on the right are all members of a conceptual

category: rectangle. Their common features are (1) 4 lines;

(2) opposite lines parallel; (3) lines connected at ends;

(4) lines form 4 right angles.

• The fact that they are different colors and sizes and have

different orientations is irrelevant. Color, size, and orientation

are not defining features of the concept


• If a stimulus is a member of a specified conceptual category,

it is referred to as a “positive instance”. If it is not a member,

it is referred to as “negative instance”. These are all negative

instances of the rectangle concept:

• As rectangles are defined, a stimulus is a negative instance if it

lacks any one of the specified features.


• Every concept has two components:

• Attributes: These are features of a stimulus that one must look

for to decide if that stimulus is a positive instance of the concept.

• A rule: This a statement that specifies which attributes must be present or absent for a stimulus to qualify as a positive instance

of the concept.

• For rectangles, the attributes would be the four features

discussed earlier, and the rule would be that all the attributes

must be present.


• The simplest rules refer to the presence or absence of a single attribute. For example, a “vertebrate” animal is defined as an animal with a backbone. Which of these stimuli are positive instances?

• This rule is called affirmation. It says that a stimulus must possess a single specified attribute to qualify as a positive instance of a concept.


+ + +_

• The opposite or “complement” of affirmation is is negation. To qualify as a positive instance, a stimulus must lack a single specified attribute.

• An invertebrate animal is one that lacks a backbone. These are the positive and negative instances when the negation rule is applied.


+__ _

Concept Learning

• In behavioral terms, when a concept is learned, two processes control how we respond to a stimulus:

• Generalization: We generalize a certain response (like the name of an object) to all members of the conceptual class based on their common attributes.

• Discrimination: We discriminate between stimuli which belong to the conceptual class and those that don’t because they lack one or more of the defining attributes.


Concept Learning Behavioral

Processes

For example, we generalize the word “rectangle” to those stimuli that possess the defining attributes...

...and discriminate between these stimuli and others that are outside the conceptual class, in which case we respond with a different word:

rectangle rectangle rectangle

?

Perspectives and Issues


• Data is cheap and abundant ; knowledge is expensive and scarce.

• Build a model that is a good and useful approximation to the data.


When are ML algorithms NOT needed?

❖When the relationships between all system variables (input, output, and hidden) is completely understood!

❖This is NOT the case for almost any real system!


What is needed?

• When solving a machine learning problem we must be sure to identify:

• What task is to be learned?

• How do we (will we) test the performance of our system?

• What knowledge do we want to learn?

• How do we represent this knowledge?

• What learning paradigm would be best to use?

• How do we construct a training experience for our learner?


Why Machine Learning

• Human expertise does not exist (navigating on Mars),

• Humans are unable to explain their expertise (speech recognition)

• Solution changes in time (routing on a computer network)

• Solution needs to be adapted to particular cases (user biometrics)

• Needs to identify hidden relationships and correlations within large amounts of data

• Human designers often produce machines that do not work as desired in the environments in which they are not used.


❖ The amount of knowledge available about certain tasks might be too large for explicit encoding by humans (e.g., medical diagnostic).

❖ New knowledge about tasks is constantly being discovered by humans. It may be difficult to continuously re-design systems “by hand”.


Advantages of ML

➢ Alleviate Knowledge Acquisition Bottleneck• Does not require knowledge engineers

• Scalable in constructing knowledge base

➢ Adaptive• Adaptive to the changing conditions

• Easy in migrating to new domains

• Customizing themselves to individual users


➢ Discover new knowledge from large databases (data mining).• Market basket analysis (e.g. diapers and beer)

• Medical text mining (e.g. migraines to calcium channel blockers to magnesium)

➢ To Engineer better Computing Systems

➢Build a model that is a good and useful approximation to the data.

➢Data is cheap and abundant (data warehouses, data marts); knowledge is expensive and scarce.


Issues with Machine Learning

❖What algorithms are available for learning a concept? How well do they perform?

❖How much training data is sufficient to learn a concept with high confidence?

❖When is it useful to use prior knowledge?

❖Are some training examples more useful than others?

❖What are best tasks for a system to learn?


❖What is the best way for a system to represent its knowledge?

❖How can we optimize the accuracy on future data points?

❖Are some learning problems computationally tractable?

❖How can we formulate application problems as machine learning paradigms?


Paradigms of Machine Learning Algorithms

❖Learning algorithms fall into various paradigms withrespect to the sort of feedback that the learner hasaccess to.✓Supervised Learning✓Unsupervised Learning✓Semi Supervised Learning✓Reinforcement Learning




Supervised

❖For every input, the learner is provided with a target; that is, theenvironment tells the learner what its response should.

❖The learner then compares its actual response to the target andadjusts its internal memory in such a way that it is more likely toproduce the appropriate response the next time it receives the sameinput.

❖We can think of learning a simple categorization task as supervisedlearning.



• Supervised learning: (a) presents a three-class labeled dataset, where the color shows the corresponding label of each sample. After supervised learning, the class-separating boundary could be found as the dotted lines in (b).

(a) (b)

Classification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

Test

Set

Training

SetModel

Learn

Classifier

• Support Vector Machines

• logistic regression,

• linear discriminant analysis,

• decision trees

• k-nearest neighbor algorithm

• Neural Networks (Multilayer perceptron)

• naive Bayes etc.


Unsupervised

❖The learner receives no feedback from the world at all.

❖Instead the learner's task is to re-represent the inputs in a moreefficient way, as clusters or categories or using a reduced set ofdimensions.

❖Unsupervised learning is based on the similarities and differencesamong the input patterns. It does not result directly in differences inovert behavior because its "outputs" are really internalrepresentations.


Dr. G. Bharadwaja Kumar, VIT Chennai 72(a) (b)

Clustering

•The goal of clustering is to

• group data points that are close (or similar) to each other

• identify such groupings (or clusters) in an unsupervised manner i.e. no information is provided to the algorithm on which data points belong to which clusters

Examples of Clustering

• Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts.• Tailor-made for each person: too expensive

• One-size-fits-all: does not fit all.

• Example 2: In marketing, segment customers according to their similarities• To do targeted marketing.

• Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics.

• Spatial Data Analysis

• E. g., land use, city planning, earth-quake studies

• Neural network models (self-organizing map (SOM) and adaptive resonance theory (ART))

• Clustering (e.g., k-means, Gaussian mixture models, k-mode)


❖Semi-supervised learning falls between unsupervised learning(without any labeled training data) and supervised learning (withcompletely labeled training data).

❖It makes use of both labeled and unlabeled data for training -typically a small amount of labeled data with a large amount ofunlabeled data.



• Semi-supervised learning. (a) presents a labeled dataset (with red, green, and blue) together with a unlabeled dataset (marked with black). The distribution of the unlabeled dataset could guide the position of separating boundary.

(a) (b)

• Semi-Supervised Support Vector Machines (S3VMs)

• Laplacian Regularized Least Squares (LapRLS)

• Semi-Supervised Random Forests


Reinforcement learning

❖The learner receives feedback about the appropriateness of itsresponse.

❖For correct responses it resembles supervised learning: However, thetwo forms of learning differ significantly for errors, situations in whichthe learner's behavior is in some way inappropriate.

❖ In these situations, supervised learning lets the learner know exactlywhat it should have done, whereas reinforcement learning only saysthat the behavior was inappropriate and (usually) how inappropriateit was.



• Q-Learning

• Temporal Difference Learning:

• Prioritized Sweeping

• Dynamic Bayesian Network-Markov Decision Process


Classification of ML Algorithms

❖Based on how data is available to learning algorithms or way learning happens✓Batch Learning✓Online Learning✓Instance Based Learning✓Incremental learning✓Deep Learning✓Evolutionary Learning✓Sequence learning


Batch Learning or Offline Learning

• Machine learning algorithms assume they have access to the entire training dataset at once.

• In general, Machine learning fall into this category.

• SVM, Neural Networks (MLP) etc.


Online Learning

• Data available in a sequential fashion at a very high rate i.e. data streams and not possible to store the data which enforces real time analysis

• Slightly different characteristics than time series data

• Examples of data streams include computer network traffic, web searches, and sensor data.


• VERY FAST DECISION TREE (VFDT)

• Concept-Adapting Very Fast Decision Tree (CVFDT)

• BIRCH

• STREAM

• CluStream


Sequence Learning

• Most machine learning algorithms are designed for independent, identically distributed (i.i.d.) data

• Sequence learning is the study of machine learning algorithms designed for sequential data. These algorithms should• not assume data points to be independent i.e. the data instances

are strongly correlated

• able to deal with sequential distortions

• make use of context information


• Applications include speech recognition, gesture recognition, protein secondary structure prediction, handwriting recognition.

• Algorithms: Hidden Markov Models(HMM), Conditional Random Fields (CRF), Maximum-Entropy Markov model (MEMM),


Instance-based or Memory-based learning

• Instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory.

• K-nearest neighbor, Neural Network (RBF), Locally Weighted Regression


Incremental Learning

• Capable to learn and update with every new data – labeled or unlabeled.


• Incremental SVM

• Incremental HMM

• Incremental Sigmoid Belief Networks (ISBNs)

• Incremental Decision Trees (ID5R)


Deep Learning

• Deep Learning is about learning multiple levels of representation and abstraction that help to make sense of data such as images, sound, and text.

• Deep Boltzmann Machine, Convolutional Deep Neural Networks, Deep Belief Networks


Displaying the structure of a set of documents using Latent Semantic Analysis (a form of PCA)

Each document is converted to a vector of word counts. This vector is then mapped to two coordinates and displayed as a colored dot. The colors represent the hand-labeled classes.

When the documents are laid out in 2-D, the classes are not used. So we can judge how good the algorithm is by seeing if the classes are separated.

Displaying the structure of a set of documents using a deep neural network

Regularization

• An extension made to the basic learning method that penalizes models based on their complexity, favoring simpler models that are also better at generalizing.

• The most popular regularization algorithms are:

• Ridge Regression

• Least Absolute Shrinkage and Selection Operator (LASSO)

• Elastic Net

• Least-Angle Regression (LARS)


Ensemble Algorithms

• Ensemble methods are models composed of multiple weaker models that are independently trained and whose predictions are combined in some way to make the overall prediction.

• Boosting• Bootstrapped Aggregation (Bagging)• AdaBoost• Stacked Generalization (blending)• Gradient Boosting Machines (GBM)• Gradient Boosted Regression Trees (GBRT)• Random Forest


Dimensionality Reduction Algorithms

• This can be useful to reduce the number of features either to visualize dimensional data or simplify learning process

• Principal Component Analysis (PCA)

• Principal Component Regression (PCR)

• Partial Least Squares Regression (PLSR)

• Sammon Mapping

• Multidimensional Scaling (MDS)

• Projection Pursuit

• Linear Discriminant Analysis (LDA)

• Mixture Discriminant Analysis (MDA)

• Quadratic Discriminant Analysis (QDA)

• Flexible Discriminant Analysis (FDA)


Linear&

Non-Linear Separability


Linear Separability

• Let X0 and X1 be two sets of points in an n-dimensional Euclidean space. Then X0 and X1 are linearly separable if there exists n + 1 real numbers w1, w2,..,wn, k, such that every point x in X0

satisfies and every point x in X1 satisfies

where xi is the ith component of x.


𝑖=1

𝑛

𝑤𝑖𝑥𝑖 > 𝑘

𝑖=1

𝑛

𝑤𝑖𝑥𝑖 < 𝑘

• Linearly separable data: if all the data points can be correctly classified by a linear decision boundary (line).



These two sets are linearly separable because there exists a line in the plane with all of the blue points on one side of the line and all the red points on the other side. This idea immediately generalizes to higher-dimensional Euclidean spaces if line is replaced by hyper plane.

• If not linearly separable✓Allow some errors

✓Still, try to place hyperplane “far” from each class


Non-Linear Separability


Non Linear problem


Linear Separability

• Linear or Non linear separable data?• We can find out only empirically

• Linear algorithms (algorithms that find a linear decision boundary)• When we think the data is linearly separable

• Advantages• Simpler, less parameters

• Disadvantages• High dimensional data is usually not linearly separable

• Examples: Perceptron, SVM


Nonlinear Separability

• It is well known that a non linear mapping from a small dimensional space into a high-dimensional space facilitates linear classification.


• In (a) a two-dimensional input space is depicted, in which the yellow spheres and the red stars cannot be separated with a single straight line.

• With a nonlinear mapping into a three-dimensional space, as depicted in (b), the spheres and stars can be separated by a single linear hyperplane.




Radial Basis Function (RBF) kernel in LIBSVM

Multiclass or multinomial classification

•Given: some data items that belong to one of M mutually-exclusive classes

•Task: Train the classifier and predict the class for a new data item

•Geometrically: harder problem, no more simple geometry


Multi-class classification


• For example, classifying a set of images of fruits which may be oranges, apples, or pears.

• Multiclass classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.

• The first category of algorithms include decision trees, neural networks, k-Nearest Neighbor, Naive Bayes classifiers.


• The basic SVM supports only binary classification, but extensions have been proposed to handle the multiclass classification case as well.


• The multiclass classification problem can be decomposed into several binary classification tasks that can be solved efficiently using binary classifiers.

• The most successful and widely used binary classifers are the Support Vector Machines. The idea is similar to that of using codewords for each class and then using a number of binary classifiers in solving several binary classification problems, whose results can determine the class label for new data.


One-versus-all (OVA)

• The simplest approach is to reduce the problem of classifying among K classes into K binary problems, where each problem discriminates a given class from the other K-1 classes

• When testing an unknown example, the classifier producing the maximum ouput is considered the winner, and this class label is assigned to that example


All-versus-all (AVA)

• In this approach, each class is compared to each other class . A binary classifier is built to discriminate between each pair of classes, while discarding the rest of the classes. This requires building K(K−1)/2 binary classifiers.

• When testing a new example, a voting is performed among the classifiers and the class with the maximum number of votes wins


Error-Correcting Output-Codes

• Output-code based strategies are fairly different from one-vs-the-rest and one-vs-one. With these strategies, each class is represented in a euclidean space, where each dimension can only be 0 or 1. Another way to put it is that each class is represented by a binary code (an array of 0 and 1). The matrix which keeps track of the location/code of each class is called the code book. The code size is the dimensionality of the aforementioned space. Intuitively, each class should be represented by a code as unique as possible and a good code book should be designed to optimize classification accuracy.



Multi-Label Classification

•Given: some data items that belong to more than one class of M possible classes

• Task: Train the classifier and predict the class for a new data item

•Geometrically: harder problem, no more simple geometry


Multi-label classification: Examples• Language identification

• Text categorization (topics)

• For instance, an article in a newspaper may be assigned to the categories POLITICS, SPORTS, RELIGION, etc.

• Some classification algorithms/models have been adapted to the multi-label task, without requiring problem transformations. Examples of these include:

• boosting: AdaBoost.MH and AdaBoost.MR are extended versions of AdaBoost for multi-label data.

• k-nearest neighbors: the ML-kNN algorithm extends the k-NN classifier to multi-label data.

• decision trees: "Clare" is an adapted C4.5 algorithm for multi-label classification; the modification involves the entropy calculations.

• neural networks: BP-MLL is an adaptation of the popular back-propagation algorithm for multi-label learning.


• Various binary classifiers have been developed over time and there is no clear winner as to which classifier performs the best. Different classifiers perform differently depending on the number of observations, the dimensionality of the feature vector, the noise in the data and various other factors. For e.g. random forests perform better than SVM classifiers for 3D point clouds


General Assumptions on Dataset

•In machine learning, an unknown universal dataset is assumed to exist, which contains all the possible data pairs as well as their probability distribution of appearance in the real world.


• While in real applications, what we observed is only a subset of the universal dataset due to the lack of memory or some other unavoidable reasons.

• This acquired dataset is called the training set (training data) and used to learn the properties and knowledge of the universal dataset.

• In general, vectors in the training set are assumed independently and identically sampled (i.i.d) from the universal dataset



Bias – Variance Tradeoff


• The bias-variance tradeoff is an important aspect of data science projects based on machine learning.

• Any learning algorithm that use mathematical or statistical models whose “error” can be split into two main components: reducible and irreducible error.

• Irreducible error or inherent uncertainty is associated with a natural variability in a system. On the other hand, reducible error, as the name suggests, can be and should be minimized further to maximize accuracy.



• Suppose our outcome variable is Y and covariates are X, we may assume that there is a relationship relating one to the other such as Y=f(X)+ϵ where the error term ϵ is normally distributed with a mean of zero like so ϵ∼N(0,σϵ)


We may estimate a model f^(x) of f(X) using linear regressions or another modeling technique. In this case, the expected squared prediction error at a point x is:

• That third term, irreducible error, is the noise term in the true relationship that cannot fundamentally be reduced by any model. Given the true model and infinite data to calibrate it, we should be able to reduce both the bias and variance terms to 0.

• However, in a world with imperfect models and finite data, there is a tradeoff between minimizing the bias and minimizing the variance.


• Reducible error can be further decomposed into “error due to bias” and “error due to variance.” The data scientist’s goal is to simultaneously reduce bias and variance as much as possible in order to obtain as accurate model as is feasible.

• However, there is a tradeoff to be made when selecting models of different flexibility or complexity and in selecting appropriate training sets to minimize these sources of error!


• The bias is error from erroneous assumptions in the learning algorithm or model mismatch. High bias can cause an algorithm to miss the relevant relations between features and target outputs (under fitting).

• The variance is error from variation due to train sample and randomization. High variance can cause (over fitting): modeling the random noise in the training data, rather than the intended outputs.


• The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data.

• Unfortunately, it is typically impossible to do both simultaneously.

• High-variance learning methods may be able to represent their training set well, but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that don't tend to overfit, but may underfit their training data, failing to capture important regularities.


Error due to Bias

• The error due to bias is taken as the difference between the expected (or average obtained from cross-validation) prediction of our model and the correct value which we are trying to predict.

• Bias measures how far off in general these models' predictions are from the correct value. If these average prediction values are substantially different that the true value, bias will be high.


Error due to Variance

• The error due to variance is the amount by which the prediction, over one training set, differs from the expected predicted value, over all the training sets.


Graphical Visualization of bias and variance using a bulls-eye diagram





• In fact, your linear model is underfitting the nonlinear target function over the training set. Likewise, if your target truth is linear, and you select a nonlinear model to approximate it, then you’re introducing a bias resulting from the nonlinear model’s inability to be linear where it needs to be. In fact, the nonlinear model is overfitting the linear target function over the training set.


bias vs. variance

high biaslow variance

medium biasmedium variance

low biashigh variance

Bias-variance tradeoff


Training error

Test error

Underfitting Overfitting

Complexity Low Bias

High Variance

High Bias

Low Variance

Err

or

Bias-variance tradeoff


Many training examples

Few training examples

Complexity Low Bias

High Variance

High Bias

Low Variance

Test E

rror

Effect of Training Size


Effect of Model Complexity


Computational Learning Theory


Learning Theory

•COLT helps to define the class of learnable concepts in terms of computational complexityi.e. the time and space complexity of the learning algorithm, which depends on the cost of the computational representation of the concepts and sample complexity, i.e. the number of training examples necessary or sufficient to learn hypotheses of a given accuracy.


•A good hypothesis is a productive one. A productive hypothesis can:• Be easily learned and applied• Explain the past accurately and persuasively• Make accurate predictions about the future• Generate new even more useful hypotheses• Be applied to a wide variety of situations• Be easily tested


• Learning in the limit: Is the learner guaranteed to converge to the correct hypothesis in the limit as the number of training examples increases indefinitely?

• Sample Complexity: Can one characterize the number of training examples necessary/sufficient for highly accurate learning?

• Computational Complexity: How much computational resources (time and space) are needed for a learner to learn a highly accurate hypothesis?

• Is it possible to identify classes of concepts that are inherently difficult/easy to learn, independent of the learning algorithm?


• Mistake Bound: how many training examples will the learner misclassify before constructing a highly accurate concept


Two frameworks for analyzing learning algorithms

• Probably Approximately Correct (PAC)framework• Identify classes of hypotheses that can/cannot be learned from a polynomial

number of training samples• Finite hypothesis space

• Infinite hypotheses (VC dimension)

• Define natural measure of complexity for hypothesis spaces (VC dimension)that allows bounding the number of training examples required for inductive learning

• Mistake bound framework• Number of training errors made by a learner before it determines correct

hypothesis


PAC Learning


• PAC Model• Only requires learning a Probably Approximately Correct Concept: Learn a

decent approximation most of the time.

• Requires polynomial sample complexity and computational complexity.


• PAC-learnability is most determined by the number of training examples required by the learner. The sample complexity of the learning problem is the growth in the number of required training examples with problem size.

• This is because in practical settings the most limiting factor of the learner is the number of training examples available.


• If we characterize the number of training examples needed to learn a hypothesis h for which error = 0. Unfortunately, it turns out this is futile in the setting we are considering, for two reasons.

• First, unless we provide training examples corresponding to every possible instance in X (an unrealistic assumption), there may be multiple hypotheses consistent with the provided training examples, and the learner cannot be certain to pick the one corresponding to the target concept.

• Second, given that the training examples are drawn randomly, there will always be some nonzero probability that the training examples encountered by the learner will be misleading.


• To accommodate these two difficulties, we weaken our demands on the learner in two ways.

• First, we will not require that the learner output a zero error hypothesis-we will require only that its error be bounded by some constant, , that can be made arbitrarily small. Second, we will not require that the learner succeed for every sequence of randomly drawn training examples-we will require only that its probability of failure be bounded by some constant, , that can be made arbitrarily small.


• The only reasonable expectation of a learner is that with high probability it learns a close approximation to the target concept.

• In the PAC model, we specify two small parameters, εand δ, and require that with probability at least (1 - δ) a system learn a concept with error at most ε .


Chernoff‐Hoeffding bound

• If you average a bunch of bounded random variables, then the probability this average random variable deviates from its expectation is exponentially small in the amount of deviation.

• Let be independent random variables whose values are in the range [0,1] . let

, ,

Then for all ε > 0 ,


𝑿𝟏, 𝑿𝟐, . . . , 𝑿𝒎

𝑿 =𝒊𝑿𝒊 𝝁𝒊 = 𝑬 𝑿𝒊 𝝁 = 𝑬 𝑿 =

𝒊𝝁𝒊

𝐏𝐫(|𝑿 − 𝝁| > 𝜺) ≤ 𝟐𝒆−𝟐 Τ𝜺𝟐 𝒎

• One nice thing about the Chernoff bound is that it doesn’t matter how the variables are distributed.

• This is important because in PAC we need guarantees that hold for any distribution generating data.

• Indeed, in this case the random variables above will be individual examples drawn from the distribution generating the data.


• We’ll be estimating the probability that our hypothesis has error deviating more than ε, and we want to bound this by , as in the definition of PAC‐learning.

• Since the amount of deviation (error ε) and the number of samples (m ) both occur in the exponent, the trick is in balancing the two values to get what we want.


• A algorithm that efficiently finds a consistent hypothesis will PAC‐learn any finite concept class provided it has at least samples, where


𝑚 ≥ Τ1 휀 𝑙𝑜𝑔|𝐻| + 𝑙𝑜𝑔 Τ1 𝛿





Inductive Learning Hypothesis

•Any hypothesis found to approximate the target function well over the training examples, will also approximate the target function well over the unobserved examples.

• find the hypothesis that best fits the training data


• Contains all plausible versions of the target concept


• A hypothesis h is consistent with training examples D

iff h(x)=c(x) for each example <x, c(x)>in D

• •Version space with respect to

• hypothesis H and

• training examples D,

• is a subset of hypotheses from H consistent with the training examples in D


• VC dimension ( Vapnik–Chervonenkis dimension) is a measure of the capacity(complexity, expressive power, richness, or flexibility) of a classification algorithm, defined as the cardinality of the largest set of points that the algorithm can shatter.



The probabilistic guarantee

where N = size of training set

h = VC dimension of the model class = complexity

p = upper bound on probability that this bound fails

So if we train models with different complexity, we should pick the one that minimizes this bound

Actually, this is only sensible if we think the bound is fairly tight, which it usually isn’t. The theory provides insight, but in practice we still need some witchcraft.

2

1

)4/log()/2log(

N

phNhhEE traintest

A simple example: Fitting a polynomial

• The green curve is the true function (which is not a polynomial)

• The data points are uniform in x but have noise in y.

• We will use a loss function that measures the squared error in the prediction of y(x) from x. The loss for the red polynomial is the sum of the squared vertical errors.

from Bishop

Some fits to the data: which is best?

from Bishop

A simple way to reduce model complexity

• If we penalize polynomials that have big values for their coefficients, we will get less wiggly solutions:

2

1

||||}),({)(~

2

2

2

1www

nn txyE

N

n

regularization parameter

target value

penalized loss function

from Bishop

Regularization: vs.

Polynomial Coefficients

Generalization

• The real aim of Machine Learning is to do well on test data that is not known during learning.


Generalization

• Choosing the values for the parameters that minimize the loss function on the training data is not necessarily the best policy.

• We want the learning machine to model the true regularities in the data and to ignore the noise in the data. • But the learning machine does not know which

regularities are real and which are accidental quirks of the particular set of training examples we happen to pick.

• So how can we be sure that the machine will generalize correctly to new data?


Generalization

• One can only say a model is generalized well if it explains the data surprisingly well given the complexity of the model.

• If the model has as many degrees of freedom as the data, it can fit the data perfectly but so what?

• There is a lot of theory about how to measure the model complexity and how to control it to optimize generalization.



Generative vs. Discriminative



• Generative models: specify a joint probability distribution over observation and label sequences i.e. full probabilistic model of all variables• Model class-conditional pdfs and prior probabilities

• Discriminative models: provide a model only for the target variable(s) conditional on the observed variables.• Directly estimate posterior probabilities

• No attempt to model underlying probability distributions




Discriminative Genarative

Generative Methods

• ☺ Relatively straightforward to characterize invariances

• ☺ They can handle partially labelled data

• They wastefully model variability which is unimportant for classification

• They scale badly with the number of classes and the number of invariant transformations

• Slow on test data

• higher asymptotic error


Discriminative Methods

• ☺ They can be very fast once trained

• ☺ lower asymptotic error

• inherently supervised, cannot deal with unlabelled data

• They interpolate between training examples, and hence can fail if novel inputs are presented

• They don’t easily handle compositionality


Generative Vs. Discriminative

Generative• Naïve Bayes

• Mixtures of Gaussians

• Hidden Markov Models (HMM)

• Bayesian networks

• Markov random fields

Discriminative• Logistic regression

• SVMs

• Neural networks (MLP, RBF)

• Nearest neighbor

• Conditional Random Fields (CRF)


Parametric vs non-parametric models

• Parametric models assume some finite set of parameters θ.

• Given the parameters, future predictions, x, are independent of the observed data, D:

P(x|θ, D) = P(x|θ)

• therefore θ capture everything there is to know about the data. • So the complexity of the model is bounded even if the amount of data is unbounded. This makes them not very flexible


• Non-parametric model is not none-parametric: parameters are determined by the training data, not the model.

• Non-parametric models assume that the data distribution cannot be defined in terms of such a finite set of parameters. But they can often be defined by assuming an infinite dimensional θ. Usually we think of θ as a function.

• The amount of information that θ can capture about the data D can grow as the amount of data grows. This makes them more flexible.


• Monte Carlo Hidden Markov Models: non-parametric hidden Markov models with continuous state and observation spaces.

• Based on the Dirichlet Process, a nonparametric Bayesian Hidden Markov Model is proposed, which allows an infinite number of hidden states and uses an infinite number of Gaussian components to support continuous observations


Parameters to learn

• Two kinds of parameters

• one the user sets for the training procedure in advance –hyperparameter• the degree of polynomial to match in regression• number/size of hidden layer in Neural Network• number of instances per leaf in decision tree

• one that actually gets optimized through the training – parameter• regression coefficients• network weights• size/depth of decision tree

• we usually do not talk about the latter, but refer to hyperparametersas parameters



•Machine Learning algorithms rely on three components:

✓ Representation

✓ Optimization

✓ Evaluation


Model Selection


• The Akaike Information Criterion (AIC) is a way of selecting a model from a set of models. The chosen model is the one that minimizes the Kullback-Leibler distance between the model and the truth. It’s based on information theory, but a heuristic way to think about it is as a criterion that seeks a model that has a good fit to the truth but few parameters. It is defined as:

AIC = -2 ( ln ( likelihood )) + 2 K

where likelihood is the probability of the data given a model and K is the number of free parameters in the model.


• The second order information criterion, often called AICc, takes into account sample size by, essentially, increasing the relative penalty for model complexity with small data sets. It is defined as:

AICc = -2 ( ln ( likelihood )) + 2 K * (n / ( n – K – 1))

where n is the sample size. As n gets larger, AICc converges to AIC ( n – K – 1 n as n gets much bigger than K, and so (n / ( n – K – 1)) approaches 1), and so there’s really no harm in always using AICcregardless of sample size.


Minimum Description Length

• MDL is an information-theoretic approach to machine learning, or statistical model selection

• Basically says you should pick the model which gives you the most compact description of the data, including the description of the model itself.

• Provides a criterion for the selection of models, regardless of their complexity, without the restrictive assumption that the data form a sample from a 'true' distribution.


• Hence you really want to minimize the combined length of the description of the model, plus the description of the data under that model.

• The fundamental idea behind the MDL Principle is that any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally


198

No

Cannot Learn Exact Conceptsfrom Limited Data, Only Approximations

Learner Classifier

Positive

Negative

Yes

•A measure of the “power” or the “complexity” of the hypothesis space •Higher VC dimension implies a more “expressive”

hypothesis space

•Shattering: A set of N points is shattered if there exists a hypothesis that is consistent with every classification of the N points


VC Dimension


No free lunch theorem

• All models are wrong, but some models are useful. —George Box

• Much of machine learning is concerned with devising different models to fit the data.

• We can use methods such as cross validation to empirically choose the best model for a particular problem.

• However, there is no universally best model — this is sometimes called the no free lunch theorem (Wolpert 1996).


• The no free lunch rule for dataset: (a) is the training set we have, and (b), (c) are two test sets. As you can see, (c) has different sample distributions from (a) and (b), so we cannot expect that the properties learned from (a) to be useful in (c).


The End


Documents

Machine Learning - WordPress.comMachine learning expedite computers to learn and improve automatically using example data or past experience and enable computers to handle new situations