279
Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course

Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Introduction To Machine Learning

Portions of this course are from Machine Learning Crash Course

Page 2: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Goals of This Class

● Learn to take a real-life problem and apply machine learning to make predictions.

● Learn to implement machine learning solutions using TensorFlow

● Learn how to evaluate the quality of your solution● Machine learning is a very broad field -- we only just

touch upon some of the most common machine learning algorithms

Page 3: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Some Sample Applications of Machine Learning

Page 4: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Sample Applications of Machine Learning

● Medical applications such as disease prediction● Speech recognition and understanding● Recommendation systems● Malware and spam detection● Image understanding and annotation● AI for games● Translating between languages● Predicting likelihood of earthquakes● Matching resumes with jobs

Page 5: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Google Products Using Machine Learning

Page 6: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Google Assistant

Page 7: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Google Photos: Searching Images via Text

Page 8: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Gmail: Smart Reply

Page 9: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Google Play Music: Recommending Music

Page 10: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Game Playing: Alpha Go

Page 11: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Combined Vision and Translation

Page 12: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

What is Machine Learning?

Page 13: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

What is Machine Learning (ML)?

There are many ways to define ML.● ML systems learn how to combine data to

produce useful predictions on never before seen data

● ML algorithms find patterns in data and use these patterns to react correctly to brand new data.

Page 14: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Sample Machine Learning Problem

● Predict cricket chirps/min from the temperature in centigrade

● To the right is a plot showing cricket chirps per minute (y-axis) vs temperature (x-axis)

● What do you observe?

Page 15: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Sample Machine Learning Problem (cont)● The line shown in blue fits the

data well.● Recall that the equation for a

line is y = mx + b with slope m and y-intercept b.

Page 16: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Sample Machine Learning Problem (cont)● The line shown in blue fits the

data well.● Recall that the equation for a

line is y = mx + b with slope m and y-intercept b.

● In this example: y = 7x - 30

Page 17: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Sample Machine Learning Problem (cont)

17

89

● The line shown in blue fits the data well.

● Recall that the equation for a line is y = mx + b with slope m and y-intercept b.

● In this example: y = 7x - 30

● If we pick a temp not in the data, we can use this line to predict the chirps. If x = 17, then y = 7 * 17 - 30 = 89

Page 18: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Sample Machine Learning Problem (cont)● The line shown in blue fits the

data well.● Recall that the equation for a

line is y = mx + b with slope m and y-intercept b.

● In machine learning we often call this a model since it is what we use to make our predictions.

● We call this model a linear model since we fit the data with a line

Page 19: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Definition: Linear Regression● When a learning model fits data by a line, we say it is linear

● Regression is relationship between correlated variables

○ When you have a +/- change in a variable that affects the other variable +/-

● When you have just one input value (feature), linear regression is the problem if fitting a line to the data and using this line to make new predictions.

● By convention in machine learning we write: y’ = b + wx. So instead of m, we use w.

Page 20: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Linear Regression with Many Features● Often you have multiple features that can be used to predict a

value. ○ For example suppose you want to predict the rent for an

apartment. ○ Two relevant features are: the number of bedrooms and

the apt condition from 1 (poor) to 5 (excellent)

● In this case the input x becomes a pair (x1 , x2 ) where x1 is the number of bedrooms and x2 is the apt condition.

● By convention we make x bold since it is a vector.● Thus our equation to predict the rent becomes:

y’ = b + w1x1 + w2x2

Page 21: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Group Chat: Applying ML

Imagine the problem of building an ML model to predict apartment rental price.● What features would you use?● What would the training labels be?● Describe some unlabeled examples● Find a way to describe this as a regression task

Page 22: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Linear Regression NotationWhen you have n features the linear model becomes:

y’ = b + w1x1 + w2x2 + … + wnxn

● y’ is what we use as our predicted value for the label

● xk is a feature (one of the input values)

● wk is the weight (importance) associated with feature k

● b is the bias (y-intercept)

Sometimes w0 is used instead of b with a default that x0 = 1

The model parameters (which are to be learned) are: b, w1, w2 , …, wn

Page 23: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Loss ● We need a way to say what solution fits the data best● Loss is a penalty for an incorrect prediction● Lower loss is generally better ● Choice of loss function guides the model we choose

Do you think the green line or blue line will lead to better predictions?

Page 24: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Evaluating If a Linear Model Is Good ● We will eventually look at mathematical ways to evaluate the

quality of a model but that should not replace more informal yet intuitive ways to evaluate a model

● When you just have one variable, drawing a line on a scatter plot that shows the model predictions is a good tool.

● Let’s look at two different models to predict the price of a car from its engine’s horsepower.

Page 25: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Which Model is Better?

Page 26: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Calibration Plots To Visualize a Model● We want something that can serve a similar role to a scatter plot

when we have more than one input feature.

● For this we can use a calibration plot that shows the prediction (x-axis) versus the target value (y-axis)

● On the next slide we’ll look at two calibrations plots from different models trained on the same data set.

Page 27: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Which Model is Better?

Page 28: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Group Chat: Measuring Loss

What are some mathematical ways to measure loss? Remember that● Loss should become smaller when predictions improve.● Loss should become larger when prediction get worse.

Lots of ideas are reasonable here; be creative.

Page 29: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

A Convenient Loss Function for Regression

● L2 Loss for a given example is also called squared error= square of the difference between prediction and label= (observation - prediction)2

= (y - y’ )2

Page 30: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

For training set D composed of examples x = <x0 , x1 , x2 , …, xn> and correct label y, and the prediction for the current model w of y’= wTx = w0x0 + w1x1 + w2x2 + … + wnxn, the squared error (L2Loss) is:

Computing Squared Error on a Data Set

We’re summing over all examples in our training set D

Generally average over all examples, so divide by the number of examples in D (denoted by |D|)

Page 31: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

For training set D composed of examples x = <x0 , x1 , x2 , …, xn> and correct label y, and the prediction for the current model y’

RMSE - Root Mean Squared Error

Our goal is to train a model that minimizes RMSE (which is the same as minimizing the mean squared error).

Page 32: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Summary of Linear Regression● y’ = wx + b ● Minimize mean

squared error (or RMSE)

● In general have: y’ = b + w1x1 + … + wnxn

● Learn model weights b, w1, w2, … wn from data

w = 129.25b = 0.382

Page 33: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Learning Model Weights

Page 34: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Gradient Descent: High Level View

● Derivative of (y - y’)2 w.r.t. the model weights w tells us how loss changes in the neighborhood of weights we're currently using.

● A gradient is the generalization of a derivative when you have more than one variable.

● We can take small steps in the negative gradient direction to modify the weights so that the loss on that example is lower.

● This optimization strategy is called Gradient Descent.

Page 35: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Pictorial View of Gradient Descent

value of weight

loss

● As a simple illustration assume our model has a single weight and we are using squared loss

● To the right we plot the squared loss in blue

● The yellow dot represents our current model weight

● The red dot is the model weight minimizing squared loss

loss function

Page 36: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Pictorial View of Gradient Descent

value of weight

loss

● As a simple illustration assume our model has a single weight and we are using squared loss

● To the right we plot the squared loss in blue

● The yellow dot represents our current model weight

● The red dot is the model weight minimizing squared loss

loss function

current model weight

Page 37: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Pictorial View of Gradient Descent

value of weight

loss

● As a simple illustration assume our model has a single weight and we are using squared loss

● To the right we plot the squared loss in blue

● The yellow dot represents our current model weight

● The red dot is the model weight minimizing squared loss

loss function

optimal model weight

current model weight

Page 38: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Training: A Gradient Step

starting point

value of weight w

loss

(negative) gradient

next point

Page 39: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Learning Rate: Size of Step to Make

value of weight

loss

● The sequence of yellow dots shows how the model evolves with the weight updates

● Gradient descent with a good learning rate gives small but noticeable steps towards the optimal value

● We want to end with a model very close to the optimal model as seen here

Page 40: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Using TensorFlow (TF)

Later in the course we will present Gradient Descent in more depth. For now we just present enough detail for you to be apply to apply the TF API to obtain a good linear model for a simple real data set.

Page 41: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Using TensorFlow (TF)

import tensorflow as tf

# Define a linear regression model.estimator = tf.contrib.learn.LinearRegressor( optimizer=tf.GradientDescentOptimizer(learning_rate=0.001))

# Fit the model on training data -- minimizes squared lossestimator.fit(X_train, y_train, steps=10000)

# Use it to predict on new datapredictions = estimator.predict(X_new)

Page 42: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Training a Model with Gradient Descent

In the scatter plot we see the model evolve (from the blue to red line)

The Learning Curve shows how the loss changes over time

Page 43: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Things You Need to Decide● Learning Rate

○ Very important since this is the size of the step to take. Typically change by powers of 10 until the model is training reasonably well and then fine tune

● Number of Steps to Train

○ Time to train is proportional to this (for fixed set of features). You want to make this is as small as you can while still getting to the minimum loss possible.

● What Features to Use

○ This is very important and will be our next main topic

Page 44: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Learning Rate Way Too Low

● Moving towards optimal model but too slowly

value of weight w

loss

Page 45: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Learning Rate Too High

value of weight w

loss

Though loss is decreasing, going back and forth overshooting optimal weights

123

456

Page 46: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Learning Rate Still Too High

Still overshooting as seen in loss oscillating

value of weight w

loss

1 2

3 45

6

Page 47: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Need More Steps (loss still going down)

Page 48: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Good Learning Rate and Number of Steps

Page 49: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Converging to a Poor Model

Though the model appears to have converged, we can see from the calibration plot it is a poor model. Try a different learning rate or better features. Very important to evaluate if model is any good!

Page 50: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Transforming Features

Page 51: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Start By Exploring Your Data

● Gather Statistics○ Average values○ Min and max values○ Standard deviation

● Visualize○ Plot histograms○ Are there missing values?○ Are there outliers?

Page 52: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Feature Engineering

Process of creating features from raw data is Feature Engineering

This process of using domain knowledge and a high-level understanding of the ML model you are using to create good features

Feature Engineering is important and often a very big part of whether a ML system working well in practice.

Page 53: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Why Transform Features

● Linear models can only use a feature values via taking linear combinations of them (y’ = b + w1x1 + … + wnxn)

● Can learn better with normalized numeric features so all are equally important initially

● We must handle missing data, outliers,....

● Convert non-numeric features into numeric values○ can't multiply a weight times a string

Page 54: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Why Transform Features (cont)

● Linear models can represent a lot more problems when we add non-linearities to the features

● We can often introduce new features using domain knowledge. For example, given the width and length of a room, we might add the feature: area = width * length.

● Tokenize or lower-casing of text features● ...

Page 55: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Models tend to converge faster if features on similar scale

Transforming Numeric Features

Page 56: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Transforming Numeric Features (cont)

● As an example:○ What if you are creating a linear model to predict city

mpg of a car from highway mpg (x1) and price (x2)○ What happens if you directly use these features?

Page 57: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Transforming Numeric Features (cont)

● As an example:○ What if you are creating a linear model to predict city

mpg of a car from highway mpg (x1) and price (x2)○ What happens if you directly use these features?○ The price is a much larger number and so dominates

in the linear function w1x1 + w2x2 thus making it hard for the model to learn that highway mpg is much more important.

Page 58: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Common scaling methods:● linear: x' = (x - xmin ) / (xmax- xmin ) to bring to [0,1]● clipping: set min/max values to avoid outliers● log scaling: x = log(x) to handle very skewed distributions● z-score: x' = (x - μ) / σ to center to 0 mean and stddev σ = 1As always, experiment to see which works best in practice

Transforming Numeric Features

Page 59: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Linear Scaling

● Linearly scaling just stretches/shrinks the feature values to fall between 0 and 1 (or sometimes -1 to +1 is used).

● Transformed value x’ = (x - a) / (b - a)● For example, a feature that ranges from a=5000 to b=100000

get scaled between 0 to 1 as:

1000005000

0.0 1.00.5

52500

Page 60: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Feature Clipping● It helps to get rid of extreme values (outliers) by capping all

features above (or below) some value to a fixed value. This can be done before or after other normalizations.

Same feature, capped to max of 4.0

Rooms Per Person Capped Rooms Per Person

Page 61: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Log Scaling● Common data distribution: power law shown on the left

○ Most movies have very few ratings (often called the tail) ○ A few have lots of ratings (often called the head)

● Log scaling improves linear model performance

Page 62: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Z-Score Scaling● This is a variation of linear scaling that normalized that data to

have mean 0 and standard deviation of 1. It’s useful when there are some outliers but not so extreme you need clipping

Page 63: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Add Non-Linearities To a Linear Model

● Linear models can represent a lot more problems when we add non-linearities to the features

● One way to do this is to Bucketize numerical features

Page 64: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Motivation: Bucketizing Numeric Features

● Can try to use a linear model, but can’t capture feature behavior well.

● RMSE 9.065

● Problem is there is a different behavior for the two ranges of compression ratio

Page 65: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Bucketizing Compression Ratio

● Each binary bucketized feature has its own weight.

● RMSE 6.3

● Without also using the raw feature, we only get a single value (the bias) per bucket.

Page 66: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Using Raw and Bucketized Features

● Use raw feature and bucketized feature

● Linear + step function

● Shared slope, independent biases

● RMSE 5.7

Page 67: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Selection of Bucket Boundaries

● We want a way to automate the selection of bucket boundaries

● Quantiles: Each bucket has roughly the same number of points.

● Another option is equal width bins.

Page 68: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Creating Buckets by QuantilesEach bucket is a Boolean variable with each car price falling into a single bucket.

car_price_under_6649car_price_6649_to_7349car_price_7349_to_7999car_price_7999_to_9095car_price_9095_to_10295car_price_10295_to_12440car_price_12440_to_15250car_price_15250_to_17199car_price_17199_to_22470car_price_22470_and_up

Page 69: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Creating Bins by QuantilesBy using quantiles all “expensive” cars get put in a bucket together allowing more resolution for cars in the price range of 90% of the cars in the data set.

Bucket values for car of price $9000:

car_price_under_6649car_price_6649_to_7349car_price_7349_to_7999car_price_7999_to_9095car_price_9095_to_10295car_price_10295_to_12440car_price_12440_to_15250car_price_15250_to_17199car_price_17199_to_22470car_price_22470_and_up

0001000000

Page 70: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Quantiles vs Equal Width Bins

What is the advantage of using quantiles (left) versus equal width bins (right)?

Page 71: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Quantiles vs Equal Width Bins

● Both forms of binning provide non-linear behavior since the feature weight for each bin can be independently set

● Using quantiles gives more resolution in areas where there is more data

● For both techniques, you can adjust the number of bins to vary the amount of resolution versus the number of features introduced.

Page 72: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Representing Categorical Features

How can we represent features such as:● The day of the week● A person’s occupation(s)● The words in an advertisement● The movies a person has rated

Remember a linear model can only take a weighted combination of features.

Page 73: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Graphical View of a Linear Model

Output: y’ = w1x1 + w2x2 + w3x3 + b

Input: x = (x1 , x2 , x3 )w1 w2 w3

x1 x2 x3

b

1

● In the graphical view below inputs are at the bottom and the output is at the top. Each input (or the constant 1 for the bias) is multiplied by the edge weight and they are summed together at the output node.

Page 74: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Transforming Categorical Features

The features can be mapped to numerical values. For example for the days of the week we could use:

0 1 2 3 4 5 6Mon Tues Wed Thur Fri SunSat

If we the value we are predicting increases linearly from Mon to Sun then we could directly use this, but typically you want to learn an independent weight for each value.

Page 75: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Categorical Features: One-Hot Encoding● Introduce a Boolean variable for each feature value● Independent weight is learned for each feature value.● Example: For days of the week, introduce 7 Boolean

variables each with its own learned weight.● Sample one-hot encodings:

Mon Tues Wed Thur Fri SunSatEncoding Tuesday 0 1 0 0 0 0 0

0 0 0 0 1 0 00 0 0 0 0 0 1

7 Boolean Variables for Day of the Week

Encoding FridayEncoding Sunday

Page 76: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Efficiently Representing One-Hot Encoding

Mon Tues Wed Thur Fri SunSat

0 0 0 0 1 0 0

● Map each feature value to a unique index

One-hot encoding for Friday

7 Boolean Variables for Day of the Week

Page 77: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Efficiently Representing One-Hot Encoding

Mon Tues Wed Thur Fri SunSat

0 0 0 0 1 0 0

● Map each feature value to a unique index● Represent a one-hot encoding as sparse vector using

the index of the feature value that is not zero● Encode Friday as <4> using the indices shown in purple.

index

Dense Boolean one-hot encoding for Friday

0 1 3 4 5 62

Page 78: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

One-Hot Encoding vs Single Numeric Feature

● Single numeric feature and weight○ Represent Friday as the integer 4 that

is multiplied by single model weight ● One-hot encoding with 7 Boolean variables

for day of week○ Represent Friday as the sparse vector

<4> meaning the Boolean variable for Friday is 1 and the rest are 0.

● If the difference is not clear, please ask.

1 day of week

4

Mon Tue Wed Thu Fri Sat

1 Sun

0 0 0 0 0 0

learns a weight and bias

Learns one bias per day of week since one input is 1 and the rest are 0

Page 79: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Encoding Features that are Sets/Phrases

Sample ad: “1995 Honda Civic good condition”

Set Boolean variable for each word in ad to 1 and rest to 0:

...1993 1994 1995 1996 1997 … poor good excellent … Honda Toyota ... Civic Accord …. condition

● Sparse representation of ad: <102, 151, 200, 225, 350>

0 0 1 0 0 0 1 0 1 0 1 0 1

100 101 102 103 104 150 151 152 200 201 225 226 350indices ...

dense representation

Page 80: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Encoding Features that are Sets/Phrases

Sample ad: “1995 Honda Civic good condition”

Set Boolean variable for each word in ad to 1 and rest to 0:

...1993 1994 1995 1996 1997 … poor good excellent … Honda Toyota ... Civic Accord …. condition

● Sparse representation of ad: <102, 151, 200, 225, 350>

● We call this a sparse encoding since in the dense representation there are only a few 1s. A one-hot encoding is a special case with a single 1.

0 0 1 0 0 0 1 0 1 0 1 0 1

100 101 102 103 104 150 151 152 200 201 225 226 350indices ...

Page 81: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Vocabulary for Categorical Features

Vocabulary: The mapping from the feature value to its unique index.● Boolean variable is 1 if the feature has

that value and 0 otherwise● Represent as a one-hot or sparse vector

using a set of the non-zero indices

... but sometimes we can't fit all possible feature values into the vocabulary

0 Mon

1 Tues

2 Wed

3 Thur

4 Fri

5 Sat

6 Sun

Vocabulary

Page 82: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Vocabulary - Out of Vocab

Out of Vocabulary (OOV)● Give unique indices to

frequent terms/values● Add one extra index to the

vocabulary, OOV● For items not in the

vocabulary, assign to OOV

0 Super

1 Common

2 Terms

...

N-1 Important

N OOV

Page 83: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

{ man, woman, car, pear, castle, these, are, not, frequent, terms}

Hashing to Define Vocabulary Mapping0 car

1 man

2 not

3 frequentwoman

4 castle

5 are

6 theseterms

7 pear

Hash(string)

Hash Table size of 8

Page 84: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Categorical vs Numerical Features

● How would you encode a feature such as zipcode?● Though it could be used directly as a numerical

feature, you don’t want to multiply it by a weight.● Best to treat zipcode as categorical data.● You could use domain knowledge to group zipcodes

that are geographically nearby into a single bucket.● You need to think about raw features and how to best

use them.

Page 85: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Feature Engineering: Missing FeaturesOften data sets are missing features. If this is extremely rare we could just skip those examples, but otherwise what do we do?● For non-numerical data?

Page 86: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Feature Engineering: Missing FeaturesOften data sets are missing features. If this is extremely rare we could just skip those examples, but otherwise what do we do?● For non-numerical data?

○ A common solution is to just introduce a feature value for “missing”

● How do we handle this for numerical data?

Page 87: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Feature Engineering: Missing FeaturesOften data sets are missing features. If this is extremely rare we could just skip those examples, but otherwise what do we do?● For non-numerical data?

○ A common solution is to just introduce a feature value for “missing”

● How do we handle this for numerical data?○ Use the average value (or some common value)○ Bin the data and introduce a “missing” bin

Page 88: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Add Non-Linearities To a Linear Model

● Linear models can represent a lot more problems when we add non-linearities to the features

● We’ve already seen one way to do this:○ Bucketizing Numerical Features

● Other ways to introduce non-linearities:○ Feature Crosses○ Adding Artificial Variables

Page 89: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Preview: Feature Crosses

● We will study feature crosses in more depth later.● In a linear model the contribution captured for

each feature is independent of the others and this is often not the case in data.

● Feature Crosses introduce non-linear behavior between a set of two or more features by capturing dependencies between the features.

Page 90: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Feature Cross Example

● There is one weight for each feature and for all 16 possible combinations of these feature values.

● Color encodes the value of the weight with red being low and green being high

Cold Cool Warm Hot

None w9 w10 w11 w12

Low w13 w14 w15 w16

Med w17 w18 w19 w20

High w21 w22 w23 w24

Rainfall x Temperature

None Low Med High

w1 w2 w3 w4

RainfallCold Cool Warm Hot

w5 w6 w7 w8

Temperature

Page 91: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Feature Crosses in TensorFlow

● Define new features called crossed column in TF for the cross [A x B] using either bucketized numerical features or categorical features A and B

● The resulting crosses are often extremely sparse

● Crosses can involve any number of features, such as: [A x B x C x D x E]

Page 92: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Feature Crosses: Some Examples

● Housing market price predictor:[latitude X longitude x num_bedrooms]

● Predictor of pet owner satisfaction with pet:[pet behavior type X time of day]

● Tic-Tac-Toe predictor:[pos1 x pos2 x … x pos9]

Page 93: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Visualization of Weights with a Cross

● There is one weight for each feature and for all 16 possible combinations of these feature values.

● Color encodes the value of the weight with red being low and green being high

Cold Cool Warm Hot

None w9 w10 w11 w12

Low w13 w14 w15 w16

Med w17 w18 w19 w20

High w21 w22 w23 w24

Rainfall x Temperature

None Low Med High

w1 w2 w3 w4

RainfallCold Cool Warm Hot

w5 w6 w7 w8

Temperature

Page 94: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Feature Crosses: Why would we do this?

● Linear learners scale well to massive data● But without feature crosses, the expressivity of these

models would be limited● Using feature crosses + massive data is one efficient

strategy for learning highly complex models○ Foreshadowing: Neural nets provide another

● Are there downsides to adding more and more crosses?

Page 95: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Feature Engineering: Be Creative

● We’ve just seen some examples but the key is to think about your data, what a linear model can do and then how to best capture that.

● As one more example, if you wanted to predict the rental price for an apartment and you had a street address, how might you represent that?

● Remember, there’s not one right answer. Sometimes you need to try a few things and see what gives you the best predictions.

Page 96: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Stochastic Gradient Descent: A Closer Look

Page 97: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Review of Overview of Gradient Descent

● Derivative of (y - y’)2 w.r.t. the model weights w tells us how loss changes in the neighborhood of weights we're currently using.

● A gradient is the generalization of a derivative when you have more than one variable.

● We can take small steps in the negative gradient direction to modify the weights so that the loss on that example is lower.

● This optimization strategy is called Gradient Descent.

Page 98: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Training: Decreasing the Loss

loss function

starting point

value of weight w

loss

Page 99: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Training: A Gradient Step

starting point

value of weight w

loss

(negative) gradient

next point

Page 100: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Learning Rate (Size of Step) Too Small

loss function

value of weight w

loss

Too small of steps learning is too slow

Page 101: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Learning Rate (Size of Step) Too Large

value of weight w

loss

Overshoots if learning rate is too high

Page 102: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Well-Tuned Learning Rate

value of weight w

loss

Page 103: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Local Minimum

value of weight w

loss

When the loss function we are optimizing is not convex we can get stuck in local minimum

Local minimum

Page 104: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Mini-Batch Gradient Descent

● Computing gradient over the entire dataset for every step is expensive and unnecessary

● Mini-Batch Gradient Descent: Compute gradient on batches typically of 10-1000 samples

Page 105: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Stochastic Gradient Descent (SGD)

● Mini-Batch Gradient Descent using batches that are random samples of examples

● Random samples allow the gradient estimate to be more accurate.

● Many data sets are arranged in some order (e.g. imagine if the automobile data set was sorted by price).

● Easy way to get random samples is to randomly “shuffle” the data and then just take first batch-size examples, then next batch-size examples, ...

Page 106: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Update Step in SGD

● At a high level what happens in each step is

● Remember w and b are instance variables that are part of the internal state of the estimator object that are updated via this formula when the fit method is called.

This is the gradient for the weight being updated as estimated from the labeled examples in the batch

Page 107: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

● To make this concrete let’s consider when y’ = wx + b and we are minimizing (y’ - y)2 . So for w

● is

● So at each step:

Update Step in SGD (cont)

Page 108: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Update Step in SGD (cont)● To make this concrete let’s consider when y’ = wx + b and we are

minimizing (y’ - y)2 . So for w

● is

● So at each step:

● Similarly, at each step:

Page 109: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Update Step in SGD (cont)● To make this concrete let’s consider when y’ = wx + b and we are

minimizing (y’ - y)2 . So for w

● is

● So at each step:

● Similarly, at each step:

● The gradient highlighted in yellow with a red outline is computed by taking the average value over the examples in the batch.

● When y’ < y (prediction too small), y’ - y is negative so w gets bigger bringing the prediction closer y. If y’ > y, w gets smaller.

Page 110: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Generalization

Page 111: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Will Our Model Make Good Predictions?

● Our Key Goal: predict well on new unseen data.● Problem: We only get a sample of data D for training

● We can measure the loss for our model on the sample D but how can we know if it will predict well on new data?

● Performance Measure: A measure of how well the model predicts on unseen data. This is usually related to the loss function.

Page 112: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Example Performance Measure

For regression a common performance metric is root mean squared error (RMSE)

On this dataset, for the best linear model: RMSE = 22.24

Page 113: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Non-Linear Model

What if we use a model that is not a line (called non-linear)

For the model on the right:RMSE = 21.44

This is better than the linear model with RMSE = 22.24

Page 114: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

“Perfect” Model

If we make a very complex model then we can perfectly fit the data, so RMSE = 0

Why might you not want to do this?

Page 115: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Measuring Generalization Ability

Here’s the other half of the data.

We will discuss this more but for now, let’s refer to this “other half” as the test data and the original points as the training data since they were used to train the model.

Page 116: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Generalization for Linear Model

We’ll look at our performance using the best linear model.

Training RMSE = 22.24Test data RMSE = 21.98

Pretty similar.

Page 117: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Generalization for Non-Linear Model

We’ll look at our performance using the non-linear model.

Old RMSE = 21.44New RMSE = 22.74

Still pretty similar.

Page 118: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Generalization for “Perfect” Model

Training data RMSE = 0Test data RMSE = 32This did not generalize to new data!

This is called overfitting.

Page 119: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Overfitting

If we make a very complex model then can perfectly (or near perfectly) fit the training data we just memorize versus the goal of generalizing.

Remember our goal is to build a system to deal with new data!

Page 120: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

An Underfit Model

Underfitting happens when you try to use a model that is too simple.

Page 121: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

How do we know if our model is good?

● William of Occam (back in 14th century) argued that simple explanations of nature are better

● Occam’s Razor principle: the less complex a model is, the more likely to predict new data well

● How can we define how complex a learning model is?● How can we measure how well our model generalizes?

Page 122: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Model Complexity

Our goal is to determine what model complexity is most appropriate.

We’ll discuss ways to do this.

Page 123: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

How do we know if our model is good?

● In practice to determine if our model do well on a new sample of data we use a new sample of data that we call the test set

● Good performance on the test set is a useful indicator of good performance on the new data as long as:

■ The test set is large enough■ The test set is independent of the training set so it

truly represents new data■ We don’t cheat by using the test set over and over

Page 124: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Generalization Curve and Overfitting

A learning curve that is showing overfitting beginning to occur.

Page 125: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Training, Validation, and Test Data Sets

Page 126: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

● Set aside some of the data as test data○ Often just do at random○ Sometimes use most recent data as test data○ You need to be very careful here

Partitioning Data Sets

Training Data Test Data

Page 127: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

A Possible Workflow?

Train model on Training Data

Evaluate model onTest Data

Select features, learning rate, batch size, ... according to results on Test Data

Pick the model that does best on Test Data.Any issues here?

Page 128: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

A Solution to “Polluting” Test Data

Training DataTest Data

Validation Data

● Divide the data we are provided for training our model into two datasets○ Most of it will be in our Training Data○ A portion of it (typically 5-10%) will be used as a

Validation Data.

Page 129: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Better Workflow: Use a Validation Data

Train model on Training Data

Evaluate model on Validation Data

Select features, learning rate, batch size, ... according to results on Validation Data

Pick model that does best on Validation DataCheck for generalization ability on Test Data

Page 130: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

● You need to be very careful when partitioning your data into training, validation and test sets.

● You will explore this some in your next lab.● Let’s look at some real-life examples showing some

subtle but significant errors you can make that would lead you to believe you have a really good model but in practice one that will not generalize to new data.

Partitioning Data Sets

Page 131: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

k-fold Cross Validation

● Test data must be set aside unless there is no concern about overfitting

● What if we don’t have enough data to set aside enough for validation data?

● For these cases k-fold cross validation is often used● Basic idea is to divide the data into k roughly even size

pieces and in each of k training phases uses 1 piece as validation and the other k-1 as training data

Page 132: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Illustration of k-fold Cross-Validation

Demonstrate with k=5 where each row represents ⅕ of the data.

Training SetValidation Set

Performance measure m1 computed just on the validation set for this fold

Page 133: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Illustration of k-fold Cross-Validation

In second training phases shift which points are used for validation.

Training SetValidation Set

Performance measure m2 computed just on the validation set for this fold

Page 134: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Illustration of k-fold Cross-Validation

Continue shifting which points are used for validation.

Training SetValidation Set

Performance measure m3 computed just on the validation set for this fold

Page 135: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Illustration of k-fold Cross-Validation

Training SetValidation Set

Performance measure m5 computed just on the validation set for this fold

Page 136: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Cross-Validation: Compute Metric

● For each of the k training phases, compute the performance metric (over the validation set for that phase). This gives us m1 , m2 , . . ., mk

● Average m1 , m2 , . . ., mk to get an aggregate performance metric.

● You can also check model stability by comparing the performance across the k runs and also compute standard statistical measures such as standard deviation and error bars over the k folds.

Page 137: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Cross-Validation: Train Final Model

● To train the final model, choose the hyperparameter setting that gives you best aggregated performance over the k runs.

● Now run the algorithm with the chosen hyperparameters using all examples (other than those set aside as test data throughout) as the training data to obtain the final model.

● Use the test data, which has not been used during cross validation to check for any issues with overfitting.

Page 138: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

k-fold Cross-Validation: Pros and Cons

● The advantage of k-fold cross validation is that we can train on (k-1)/k of the data each phase and that all points are used for validation.

● The disadvantage is that k different models need to be trained for every set of hyperparameters being considered, which is slow.

● Only use k-fold cross-validation if you don’t have enough labeled data to split into independent train, validate and test sets.

Page 139: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Regularization for Simplicity

Page 140: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Ways to Prevent Overfitting

Early stopping: Use learning curve to detect overfitting. Stop training at somewhere around the red line.

Are there better approaches?

Page 141: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Penalizing Model Complexity● We want to avoid model complexity where possible.

○ remember Occam’s razor● We can introduce this idea into the optimization we do

at training time.● Empirical Risk Minimization:

minimize: Loss(Data|Model)

aim for low training error

Page 142: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Penalizing Model Complexity● We want to avoid model complexity where possible.

○ Occam’s razor● We can introduce this idea into the optimization we do

at training time.● Structural Risk Minimization:

minimize: Loss(Data|Model) + complexity(Model)

aim for low training error

...but balance against complexity

Page 143: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Regularization

● How to define complexity(model)?● One possible "prior belief" is that weights

should be normally distributed around 0● Diverging from this should incur a cost

● Can encode this idea via L2 regularization○ complexity(model) = sum of squares of the weights○ a.k.a. square of the L2 norm of the weight vector○ Encourages small weights centered around zero

0

Page 144: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

L2 Regularization

Aim for low training loss

...but balance against

complexity

Lambda controls how

these are balanced

Training Loss

Page 145: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Linear Classifier

Page 146: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Logistic Regression

● Many problems require a probability estimate as output.● Logistic Regression is focused on such problems.● Handy because the probability estimates are calibrated.

○ for example, prob(click) * bid = expected revenue● Also useful for when we need a binary classification

○ click or not click? → prob(click)○ spam or not spam? → prob(spam)

Page 147: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Logistic Regression For Classification

● Suppose we have a linear classifier to predict if email is spam or not-spam.

● The goal is to modify how we define and train a linear model in a way that we can estimate a probability that an email is spam versus just give a classification.

Page 148: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Logistic Regression: Predictions

LogOdds (wTx + b)pass linear model through a sigmoid

(pictured to the right)P

roba

bilit

y O

utpu

tRecall that:

linear model

Page 149: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

LogLoss For Predicting Probabilities

Close relationship to Shannon’s Entropy

measure from Information Theory

Page 150: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Logistic Regression and Regularization

Regularization is super important for logistic regression.● Remember the asymptotes● It’ll keep trying to drive loss to 0 in high dimensions

Two strategies are especially useful:● L2 regularization -- penalizes huge weights.● Early stopping -- limiting training steps or learning rate.

Page 151: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Prediction Bias

● Logistic Regression predictions should be unbiased meaning that: average of predictions ≅ average of observations

● Bias is a sign that something is wrong.○ Zero bias alone does not mean everything is working.○ But it’s a great sanity check.

● If you have bias, you have a problem (e.g. incomplete feature set, biased training sample, …)

● Don’t fix bias with a calibration layer, fix it in the model.● Look for bias in slices of data, this can guide improvements.

Page 152: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Calibration Plots show Bucketed Bias

Each dot represents many examples in the same bucketed prediction range

Page 153: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Evaluation Metrics for Linear Classification

Page 154: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Classification via Thresholding

● Sometimes, we use the output from logistic regression to predict the probability an event occurs.

● Another common use for logistic regression is to use it for binary classification by introducing a threshold.

● Choice of threshold is a very important choice, and can be tuned.

Page 155: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Evaluation Metrics: Accuracy

● How do we evaluate classification models?● One possible measure: Accuracy

○ the fraction of predictions we got right

Page 156: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Group Chat: Accuracy

● Devise a scenario in which accuracy would be a misleading metric of model quality.

Page 157: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Accuracy Can Be Misleading

● In many cases, accuracy is a poor or misleading metric. Two common situations that case this are:○ Class imbalance, when positives or negatives are

extremely rare○ Different kinds of mistakes have different costs

Page 158: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Useful to separate out different kinds of errors. Let’s use spam detection as an example where spam(1), and not spam(0)

Confusion Matrix

ML System Says

Spam Not Spam

Truth

Spam

True Positive

TPFalse Negative

FNNot Spam

False Positive

FPTrue Negative

TN

This is called a confusion matrix.

Page 159: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Confusion Matrix for Multi-Class Problems

● The confusion matrix to the right is for the task of digit prediction

● The strong weights along the diagonal indicate this model is very good.

● Two regions outlined in red illustrate 3 and 5 being confused.

Page 160: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Evaluation Metrics: Precision and Recall

● We return to binary classification (True/False labels)● Precision: (True Positives) / (All Positive Predictions)

○ When model said “positive” class, was it right?○ Intuition: Did the model classify as “spam” too

often?● Recall: (True Positives) / (All Actual Positives)

○ Out of all the possible positives, how many did the model correctly identify?

○ Intuition: Did it classify as “not spam” too often?

Page 161: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Precision vs Recall Trade-offs

● A system with high precision might leave out some good email, but it is very unlikely to let spam through

● A system with high recall might let through some spam, but it also is very unlikely to miss any good email.

● The trick is to balance them -- and which kind of error is more problematic depends a lot on the application. What are some examples of this?

Page 162: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Selecting a Threshold

1.0 (Spam)0.0 (Not Spam)

● True label shown by color ● Prediction from model shown as location on the line

● What happens with a larger decision threshold (B vs A)?○ Prec = TP / (TP + FP)○ Recall = TP / (TP + FN)

A B

TPFPFNTN

Page 163: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Selecting a Threshold

1.0 (Spam)0.0 (Not Spam)

● True label shown by color ● Prediction from model shown as location on the line

● What happens with a larger decision threshold (B vs A)?○ Prec = TP / (TP + FP)○ Recall = TP / (TP + FN)

A

TPFPFNTN FPFP

Page 164: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Selecting a Threshold

1.0 (Spam)0.0 (Not Spam)

● True label shown by color ● Prediction from model shown as location on the line

● What happens with a larger decision threshold (B vs A)?○ Prec = TP / (TP + FP)○ Recall = TP / (TP + FN)

B

TPFPFNTN FNFN

Page 165: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Selecting a Threshold

1.0 (Spam)0.0 (Not Spam)

● True label shown by color ● Prediction from model shown as location on the line

● What happens with a larger decision threshold (B vs A)?○ Prec = TP / (TP + FP) -- Increases since less FP○ Recall = TP / (TP + FN) -- Decreases since more FN

A B

TPFPFNTN

Page 166: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

An ROC Curve

FP Rate

TP R

ate

10

0 1

TP vs. FP rate at another decision threshold

TP vs. FP rate at one decision threshold

Page 167: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Sample ROC Curve

1.0 (Spam)0.0 (Not Spam)FPR = 1/19, TPR = 3/7FPR = 7/19, TPR = 6/7

Page 168: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Sample ROC Curve

1.0 (Spam)0.0 (Not Spam)FPR = 2/19, TPR = 5/7

Page 169: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

● AUC: “Area under the ROC Curve”● Interpretation:

○ If we pick a random positive and a random negative, what’s the probability my model scores them in the correct relative order?

● Intuition: gives an aggregate measure of performance aggregated across all possible classification thresholds

Evaluation Metrics: AUC

Page 170: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Group Chat: AUC

What will happen to the AUC if multiply each of the predictions (the y') for a given model by 2.0?

Page 171: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

precision * recall

F-measure● Another commonly way to balance recall and precision is F-measure which

is the harmonic mean of precision and recall (so ranges from 0 to 1).

precision + recallF = 2 *

● Observe that if precision = recall = x then the F-measure is x. Just to give a feel here are a few sample values.

recall

Page 172: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Confusion Matrix for Multiclass Data

This confusion matrix provides a visualization for the task of digit recognition (so possible labels are 0, 1, …, 9). The bar on the right shows the mapping from color to the probability. The high values along the diagonal indicate a strong model.

Page 173: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Regularization for Sparsity

Page 174: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Model Complexity

● Remember Occam’s Razor: Increasing the model complexity increases the chances of overfitting.

● Start simple and build up the complexity only when needed. Your final solution might be sophisticated, but your first attempt shouldn’t be.

Page 175: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Let’s Go Back to Feature Crosses

Caveat: Sparse feature crosses may significantly increase feature space

Possible issues:○ Model size may become huge○ Model requires more data and longer to train○ Overfitting becomes more of a problem

Page 176: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Reasons to Want Sparsity

● Would be nice to encourage weights to exactly 0 where possible○ Saves RAM, may reduce overfitting○ Also those features can be removed from

the model and no longer gathered and stored

Page 177: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

L1 Regularization

● L0 regularization: penalize a weight for being non-zero○ Computationally hard and too expensive to compute

● Relax to L1 regularization: ○ Penalize sum of abs(weights)○ Convex problem so computationally feasible○ Encourages sparsity (unlike L2)

Page 178: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Visualizing L1 and L2 Regularization in 2D

L2 ball lookslike a circle

L1 ball lookslike a diamond

Nearest point on L1 ball is a corner:has a coordinate with value exactly 0.0

Nearest point on L2 ball has two non-zero coordinates

Here’s a point we want to get back to the edge of a ball

Page 179: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Intro to Neural Networks

Page 180: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Other Ways to Learn Nonlinearities

● One alternative to feature crosses:○ Structure the model so that features are used

together as in feature crosses○ Then combine those combinations to get even more

intricate features.

● How can we choose these combinations?○ Let the machine learning model discover them

Page 181: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

A Linear Model

Output: y’ = w1x1 + w2x2 + w3x3 + b

Input: x = (x1 , x2 , x3 )

w1 w2w3

x1 x2 x3

Note: the bias b is part of the model though not pictured

Page 182: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Adding Another Linear Layer

Output

Hidden Layer

Input

Page 183: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Adding Another Linear Layer

Page 184: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Adding Another Linear Layer

w1w2

w3

b

Output

Hidden Layer

Input

Page 185: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Adding Another Linear Layer

w1w2

w3

b

Output

Hidden Layer

Input

Page 186: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

ReLu: A Simple Non-Linear Function

ReLuRectified Linear Unit

( )=max(0, )

Page 187: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Adding a Non-Linearity

Output

Hidden Layer(Linear)

Input

Non-Linear Transformation Layer (a.k.a. Activation Function)

By convention we combine the Activation function into the hidden layer making it a non-linear function. So the network to the left is draw as:

Page 188: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Neural Nets Can Be Arbitrarily Complex

Output

Hidden2

Hidden1

Input

Training done via BackProp algorithm: gradient descent in very non-convex space. We will describe this soon.

Demo 1Demo 2Demo 3

Page 189: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Training Neural Nets

Page 190: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Intro to Backpropagation

●● Backpropagation algorithm visual explanation

Neural Networks are trained with an algorithm called backpropagation that generalizes gradient descent to deeper networks.

Page 191: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Backprop: What You Need To Know

● Gradients are important○ If it’s differentiable, we can probably learn on it

● Gradients can vanish○ Each additional layer can successively reduce signal vs. noise○ ReLu’s are useful here

● Gradients can explode○ Learning rates are important here○ Batch normalization (useful knob) can help

● ReLu layers can die○ Lower your learning rates

Page 192: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Review: Normalizing Feature Values

● We’d like our features to have reasonable scales○ Roughly zero-centered, [-1, 1] range often works well○ Helps gradient descent converge; avoid NaN trap○ Avoiding outlier values can also help

● Can use a few standard methods:○ Linear scaling○ Hard cap (clipping) to max, min○ Log scaling

Page 193: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Dropout Regularization

● Dropout: Another form of regularization, useful for Neural Networks

● Works by randomly “dropping out” unit activations in a network for a single gradient step

● The more you drop out, the stronger the regularization○ 0.0 = no dropout regularization○ 1.0 = drop everything out! learns nothing○ Intermediate values more useful

Page 194: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Multi-Class Neural Nets

Page 195: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

More than two classes?

● Logistic regression gives useful probabilities for binary-class problems.○ spam / not-spam○ click / not-click

● What about multi-class problems?○ animal, vegetable, mineral○ red, orange, yellow, green, blue, indigo, violet○ apple, banana, car, cardiologist, ..., walk sign, zebra, zoo

Page 196: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

● Create a unique output for each possible class● Train that on a signal of “my class” vs “all other classes”● Typically optimize logistic loss

One-Vs-All Multi-Class

apple: yes/no?

bear: yes/no?

candy: yes/no?

dog: yes/no?

egg: yes/no?

one-vs-all(sigmoid)

hidden

hidden logits

Page 197: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

SoftMax Multi-Class● Add an additional constraint that all nodes to sum to 1.0

○ Allows outputs to be interpreted as probabilities● Typically optimize cross-entropy loss (measure of similarity of distributions)

SoftMax Multi-Class

apple: yes/no?

bear: yes/no?

candy: yes/no?

dog: yes/no?

egg: yes/no?

one-vs-all(softmax)

hidden

hidden logits

Page 198: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

SoftMax Intuition

For example, If we take an input of [1, 2, 3, 4, 1, 2, 3], the softmax is [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175].

The prediction for label j is:

Softmax definition where K is the set of possible labels:

Page 199: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

SoftMax Options

● Full SoftMax○ Brute force --Calculates the denominator on all classes. This is

only feasible when there are a relatively small number of classes.

● Candidate Sampling○ Calculates the denominator for all the positive classes, but only

for a random sample of the negatives. This can scale to millions of classes since the number of positive classes for any single example is generally small.

Page 200: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Single-Label vs. Multi-Label

Multi-Class, Single-Label Classification○ An example may be a member of only one class.○ Constraint that classes are mutually exclusive is important if you

want to be able to treat the predictions as probabilities

Multi-Class, Multi-Label Classification:○ An example may be a member of more than one class.○ No additional constraints on class membership to exploit.○ One logistic regression loss for each possible class.

Page 201: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

What to use, when?One-vs-all is less computationally expensive than SoftMax, but at the expense of not having calibration across the labels.

Use One-vs.-All Multi-Class when:○ You want to view the output for each class as a prediction of the

probability that the example belongs to the given class○ You do not need to rank the classes (and thus can reduce the

computational cost by not calibrating across the classes).

Use SoftMax Multi-Class when:○ You want to rank all classes as to which are the best fit

Page 202: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Embeddings

Page 203: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Motivation From Collaborative Filtering

● Input: 1,000,000 users and which of 500,000 movies the user has chosen to watch

● Task: Recommend movies to users

To solve this problem some method is needed to determine which movies are similar to each other.

Page 204: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Organizing Movies by Similarity (1d)

Shrek Incredibles Harry Potter Star Wars The Dark Knight Rises

The Triplets of Belleville

MementoBleu

Page 205: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Organizing Movies by Similarity (2d)

Shrek

Incredibles

Harry Potter

Star WarsThe Dark

Knight Rises

The Triplets of Belleville

Memento

Bleu

Page 206: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Two-Dimensional Embedding

Children

Blockbuster

Arthouse

Adult

Shrek

Incredibles

Harry Potter

Star WarsThe Dark

Knight Rises

Crouching Tiger, Hidden Dragon

School of Rock

The Triplets of Belleville

Wallace and Gromit

Waking LIfeMemento

Bleu

Hero

Page 207: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Two-Dimensional Embedding

Children

Blockbuster

Arthouse

Adult

Shrek

Incredibles

Harry Potter

Star WarsThe Dark

Knight Rises

Crouching Tiger, Hidden Dragon

School of Rock

The Triplets of Belleville

Wallace and Gromit

Waking LIfeMemento

Bleu

(-1.0, 0.95)

(0.65, -0.2)

Hero

Page 208: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

d-Dimensional Embeddings

● Assumes user interest in movies can be roughly explained by d aspects

● Each movie becomes a d-dimensional point where the value in dimension d represents how much the movie fits that aspect

● Embeddings can be learned from data

Page 209: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Learning Embeddings in a Deep Network

● No separate training process needed -- the embedding layer is just a hidden layer with one unit per dimension

● Supervised information (e.g. users watched the same two movies) tailors the learned embeddings for the desired task

● Intuitively the hidden units discover how to organize the items in the d-dimensional space in a way to best optimize the final objective

Page 210: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Input Representation

(0, 1, 0, 1, 0, 0, 0, 1)

● Each example (a row in this matrix) is a sparse vector of features (movies) that have been watched by the user

● Dense representation of this example as:

Is not efficient in terms of space and time

...

Page 211: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Input Representation

...

( , , )

1. Build a dictionary mapping each feature to an integer from 0, …, # movies - 1

2. Efficiently represent the sparse vector as just the movies the user watched:

Represented as: (1, 3, 999999)

210 9999993

Page 212: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

An Embedding Layer in a Deep Network

Words in real estate ad

...

3 Dimensional Embedding

Sparse Vector Encoding

Latitude Longitude

L2 LossRegression problem to predict home sales prices:

Sale Price

Page 213: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

An Embedding Layer in a Deep Network

Raw bitmap of the hand drawn digit

...

3 Dimensional Embedding

Sparse Vector Encoding

Other features

...

Logit Layer

Target Class Label

Softmax Loss

“One-hot” target prob dist. (sparse)

Multiclass Classification to predict a handwritten digit 0 1

0 1 2 3 4 5 6 7 8 9

0 2 3 4 5 6 7 8 9

Page 214: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

An Embedding Layer in a Deep Network

User Movies (subset to use as input features)

...

3 Dimensional Embedding

Sparse Vector Encoding

Other features (optional)

...

Logit Layer

User Movies (subset to use as “labels”)

...

Softmax Loss

Target prob dist. (sparse)

...

Collaborative Filtering to predict movies to recommend:

Page 215: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Deep Network

Correspondence to Geometric View

. . .

● Each of hidden units corresponds to a dimension (latent feature)

● Edge weights between a movie and hidden layer are coordinate values

(0.9, 0.2, 0.4)

Geometric view of a single movie embedding

x

zy

Page 216: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Selecting How Many Embedding Dims

● Higher-dimensional embeddings can more accurately represent the relationships between input values

● But more dimensions increases the chance of overfitting and leads to slower training

● Empirical rule-of-thumb:○ A good starting point but should be tuned using the validation data.

Page 217: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Embeddings as a Tool

● Embeddings map items (e.g. movies, text,...) to low-dimensional real vectors in a way that similar items are close to each other

● Embeddings can also be applied to dense data (e.g. audio) to create a meaningful similarity metric

● Jointly embedding diverse data types (e.g. text, images, audio, …) define a similarity between them

Page 218: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Unsupervised Learning: Clustering

Page 219: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

What is Clustering?

Given● a set of items● a similarity metric on these items

Identify groups of most similar items

Page 220: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Some Applications of Clustering

● Grouping similar sets of items● Data compression● Generate features for ML systems● Transfer Information learned from one

setting to another (Transfer Learning)

Page 221: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

How “Hard” is Clustering?

Hard Soft

Page 222: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

How “Hard” is Clustering?

0 0 1 0

1 0 0 0

. . . .

0 1 0 0

0.1 0 0.7 0.2

0.8 0.1 0.1 0

. . . .

0.2 0.6 0.1 0.1

Items

Clusters Clusters

Hard Soft

Page 223: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Is Clustering Supervised?

● No, if similarity is given (rarely the case)

● Otherwise ask “Is similarity learned ?”○ Unsupervised: similarity is designed ad hoc from

features ○ Supervised: similarity is derived from embeddings

via a supervised Deep Neural Network

Page 224: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Understand Your Similarity Measure!● Ad hoc similarity:

○ Handle heterogeneous features with care■ Numerical data of different scales, different

distributions (e.g. Home Price, Lot size, #rooms)A good generic solution: use Quantiles

■ Categorical data of different cardinality(e.g. Zip code, Home type)

■ Missing data: ignore if rare, otherwise infer via ML

Page 225: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Clustering in Euclidean Space

● Designed withthis in mind

● Applied to this

Page 226: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Quality Analysis of Clustering

● Clustering is unsupervised (given similarity): no ground truth to compare to

● Quality Analysis is done by○ Eyeballing○ When clustering is a piece of a larger system,

measure it in the global system

Page 227: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

k-Means

● Simple● Efficient● Good basic

clustering algorithm

Page 228: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

k-Means● Initialization

○ Select k cluster centersrandomly

Page 229: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

k-Means in action● Initialization

○ Select k cluster centersrandomly

● E-Step (Expectation)○ Assign each point to the

closest cluster

Page 230: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

k-Means in action● Initialization

○ Select k cluster centersrandomly

● E-Step (Expectation)○ Assign each point to the closest

cluster● M-Step (Maximization)

○ Recompute centers as mean x and y coordinate among current assignment

Page 231: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

k-Means in action● Initialization

○ Select k cluster centersrandomly

● E-Step (Expectation)○ Assign each point to the closest

cluster● M-Step (Maximization)

○ Recompute centers as mean x and y coordinate among current assignment

Rep

eat

Page 232: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

k-Means in action

● Initialization○ Select k cluster centers

randomly● E-Step (Expectation)

○ Assign each point to the closest cluster

● M-Step (Maximization)○ Recompute centers

Rep

eat

Page 233: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

k-Means in action

● Repeat until convergence○ Always converges

when no item is jumping clusters in E-Step

Minimizes within cluster distance

● Depends on initialization

Page 234: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

k-Means for Different Initializations

Page 235: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Understand Your Data/Similarity Measure!Why does k-means produce such drastically different results for visually similar data?

Page 236: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

k-Means Summary● Cons

○ Need to determine k○ Result depends on the initialization○ “Curse of dimensionality” -- cost grows exponentially with

the dimensionality of the data (points)● Pros

○ Simple & efficient○ Guaranteed convergence○ Can be warm-started○ Generalization: soft clusters, non-spherical clusters

Page 237: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Course Review

Page 238: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

What is Machine Learning (ML)?

There are many ways to define ML.● ML systems learn how to combine data to

produce useful predictions on never before seen data

● ML algorithms find patterns in data and use these patterns to react correctly to brand new data.

Page 239: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Goals of This Class

● Learn to take a real-life problem and apply machine learning to make predictions.

● Learn to implement Machine Learning solutions using TensorFlow

● Learn how to evaluate the quality of your solution● Machine Learning is a very broad field -- we only

just touch upon some of the most common machine learning algorithms

Page 240: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Linear Regression● y’ = wx + b ● Minimize mean

squared error (or RMSE)

● In general have: y’ = b + w1x1 + … + wnxn

● Learn b, w1, w2, … wn from data

w = 129.25b = 0.382

Page 241: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Learn Model Weights with SGD

Use Learning Curve as shown on right to see when the model loss is no longer improving

Page 242: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Negative Correlation

If feature is negatively correlated then weight will be negative.

w = -772.25b = 36806.86

Page 243: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

For training set D composed of examples x = <x0 , x1 , x2 , …, xn> and correct label y, and the prediction for the current model y’

RMSE - Root Mean Squared Error

Our goal is to train a model that minimizes RMSE (which is the same as minimizing the mean squared error).

Page 244: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Using TensorFlow to Do the Computation

import tensorflow as tf

# Define a linear regression model.estimator = tf.contrib.learn.LinearRegressor( optimizer=tf.GradientDescentOptimizer(learning_rate=0.001))

# Fit the model on training data -- minimizes L2 lossestimator.fit(X_train, y_train, steps=10000, batch_size=100)

# Use it to predict on new datapredictions = estimator.predict(X_test)

Page 245: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Linear Regressor With Many Features

● We generally have more than 2 features in a linear regressor. Sample TensorFlow code:

● Calibration plot is a good way to visualize since a scatter plot doesn’t scale well to higher dimensions.

linear_regressor = tf.contrib.learn.LinearRegressor( feature_columns=[age, education_num, age_buckets, capital_gain_buckets, capital_loss_buckets, gender, race, education, occupation, native_country, workclass, education_x_age_buckets, gender_x_education_x_race], optimizer=SGDoptimizer, gradient_clip_norm=5.0 )

Page 246: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Calibration Plot

Good way to visualize if a regression model is good (just having the loss converge does not guarantee this!)

Page 247: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

From Raw Features to Better Features

● How can we best use numeric features that vary a lot in their range, e.g. the purchase price of a car and the age of the car?

● How can we use numeric features like zipcode that are very different a feature like the number of bedrooms in an apartment?

● How can we make use of non-numeric features (e.g. street address, apt type)?

Page 248: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Transformations for Real-Valued Features

Page 249: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Ways to Introduce Non-Linear Behavior

● Divide real-valued features into buckets/bins○ Good strategy is to use

quantiles so there are roughly the same number of examples in each bin

● Take the cross product of any set of bucketized and sparse features.

● Numeric features (e.g. zipcode) can be treated as categorical data and converted to sparse features

Page 250: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Reason to Bucketize (Bin) Features

If you want to predict city-mpg from the compression-ratio a single linear function would not fit well but you can get a pretty good fit by dividing compression ratio into two buckets and then learn a linear model for each bucket.

Page 251: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Creating Bins by QuantilesEach bin is a Boolean variable with each car price falling into a single bin.

car_price_under_6649car_price_6649_to_7349car_price_7349_to_7999car_price_7999_to_9095car_price_9095_to_10295car_price_10295_to_12440car_price_12440_to_15250car_price_15250_to_17199car_price_17199_to_22470car_price_22470_and_up

● So if a car had a price of $9000 then the values of the variable associated with each bin would be as shown to the right in blue.

● Each bin becomes a independent feature with a weight learned for it that only applies when the feature is 1

0001000000

Page 252: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Encoding Categorical Data

Sample ad: “1995 Honda Civic good condition”Representation:..1993 1994 1995 1996 1997 … poor good excellent … Honda Toyota ... Civic Accord …. condition

If I really showed all the words in the vocabulary for any ad very few of the words would occur. This is what we mean by saying it is sparse

0 0 1 0 0 0 1 0 1 0 1 0 1

Page 253: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Feature Crosses For Correlated Features

None Low Med High

RainfallCold Cool Warm Hot

Temperature

Cold Cool Warm Hot

None w1 w2 w3 w4

Low w5 w6 w7 w8

Med w9 w10 w11 w12

High w13 w14 w15 w16

Rainfall x Temperature

Color encodes the value of the weight with red being low and green being high

Page 254: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

TensorFlow - hash bucket size # Sample of creating a categorical column with known values race = tf.contrib.layers.sparse_column_with_keys( column_name="race", keys=["White", "Asian-Pac-Islander", "Amer-Indian-Eskimo", "Other", "Black"]) # Sample of creating a categorical columns with a hash bucket education = tf.contrib.layers.sparse_column_with_hash_bucket( "education", hash_bucket_size=50) # Sample of creating a cross gender_x_education_x_race = tf.contrib.layers.crossed_column( [gender, education, race], hash_bucket_size=1000)

Page 255: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Overfitting

If we make a very complex model then can perfectly (or near perfectly) fit the training data we just memorize versus the goal of generalizing.

Remember our goal is to build a system to deal with new data!

Page 256: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Setting Aside Validation Data

Train model on Training Data

Evaluate model on Validation Data

Select features, learning rate, batch size, ... according to results on Validation Data

Pick model that does best on Validation DataCheck for generalization ability on Test Data

Page 257: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Ensure Validation Data is Representative

This is an example of what happens if you partition data without first randomizing it. Validation data is NOT representative and thus not a good estimate of the classifier/regressor’s performance.

Page 258: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Things You Need to Decide● Learning Rate

○ Very important. Typically change by powers of 10 until the model is training reasonably well and then fine tune

● Number of Steps to Train○ Time to train is proportional to this (for a fixed set of

features) so you want to make this as small as you can but still important that you don’t undertrain.

● Batch Size ○ not that sensitive, that can be the last thing to vary

● What features to use, feature normalization, when to introduce buckets and crosses

Page 259: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Learning Rate Too High

Page 260: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Learning Rate Way Too Low

Page 261: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Learning Rate Could Still Be Higher

Page 262: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Good Learning Rate

NOTE: This model is still training and not yet overfitting so increase the number of steps!

Page 263: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Training Curve Showing Overfitting

A model with same data and learning rate trained for 500 (versus 50) iterations. Now we see it overfitting

Page 264: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Things You Need to Decide

● Learning Rate, Steps to Train, Batch Size ● What features to use, feature normalization,

when to introduce buckets and crosses● When the model is more complex you also need

to introduce ways to prevent overfitting.○ Early Stopping, L2-regularization, or dropout

● Ways to Reduce Model Size○ Smaller Buckets, Fewer Features, L1

Regularization

Page 265: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Linear Classifier

loan amount (x1)

Inco

me

(x2)

Convert Real-Valued to Probability Using:

LogOdds (wTx + b)

Pro

babi

lity

Out

put

1

0

Page 266: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

LinearClassifier vs LinearRegressor linear_regressor = tf.contrib.learn.LinearRegressor( feature_columns=[age, education_num, age_buckets, capital_gain_buckets, capital_loss_buckets, gender, race, education, occupation, native_country, workclass, education_x_age_buckets, gender_x_education_x_race], optimizer=SGDoptimizer, gradient_clip_norm=5.0 )

linear_classifier = tf.contrib.learn.LinearClassifier( feature_columns=[age, education_num, age_buckets, capital_gain_buckets, capital_loss_buckets, gender, race, education, occupation, native_country, workclass, education_x_age_buckets, gender_x_education_x_race], optimizer=SGDoptimizer, gradient_clip_norm=5.0 )

● Use Regressor to predict a real-valued feature (minimize RMSE)

● Use Classifier to predict a True (1), False (0) feature (minimize log loss)

Page 267: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

LinearClassifier vs LinearRegressor ● Choices in feature engineering the same● Process of selecting and using validation (and test data) is the same● Tuning learning rate, number of steps, batch size, regularization are

the same

● Evaluation metrics change● Instead of RMSE interested in things like accuracy, ROC curve

(trade-off in false positive vs false negative rate), AUC (area under ROC)

● AUC gives probability a random + example is predicted with a higher probability than a random - example. So 0.5 random guess and 1.0 is a perfect model.

Page 268: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Sample ROC Curve

1.0 (Spam)0.0 (Not Spam)FPR = 1/19, TPR = 3/7FPR = 7/19, TPR = 6/7

Page 269: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

ROC Curves for Models from Lab 3

Model size original: 533Model size no reg: 429Model size l2: 429Model size l1, l2: 119Model size l1 strong, l2: 70

Page 270: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

LinearClassifier With > 2 Classes

linear_classifier = tf.contrib.learn.LinearClassifier( feature_columns=feature_columns, n_classes=10, optimizer=SGDoptimizer, gradient_clip_norm=5.0 )

● Example using LinearClassifier to learn 10 classes (digits 0, …, 9)

● Here the labels must be 0, …, 9 (or a sparse feature with 10 values).● Now optimize softmax loss which is a generalization of log loss when you

have a probability distribution of more than just two values● Again we need to modify the visualizations a bit looking at a confusion

matrix versus an ROC curve.

Page 271: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Confusion Matrix

Page 272: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

DNN: Add a Non-Linearity

Output

Hidden Layer(Linear)

Input

Non-Linear Transformation Layer (a.k.a. Activation Function)

By convention we combine the Activation function into the hidden layer making it a non-linear function. So the network to the left is draw as:

Page 273: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Non-linearity in DNN let’s it learn to do this

If you want to predict city-mpg from the compression-ratio a single linear function would not fit well but you can get a pretty good fit by dividing compression ratio into two buckets and then learn a linear model for each bucket.

Page 274: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Deep Neural Networks -- Add Layers

Output

Hidden2

Hidden1

Input

● Training done via BackProp algorithm which is an extension of SGD

● The hidden layers closer to the output capture higher level features (since they learn over the features from the previous layer)

● For this network in TensorFlow you’d have :○ hidden_units=[4, 3]

Page 275: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

DNN Classifier or Regressor in TFDNNclassifier = tf.contrib.learn.DNNClassifier( feature_columns=feature_columns, n_classes=10, hidden_units=[50, 25, 10], optimizer=optimizer, gradient_clip_norm=5.0, )

DNNregressor = tf.contrib.learn.DNNRegressor( feature_columns=feature_columns, hidden_units=[50, 25, 10], optimizer=optimizer, gradient_clip_norm=5.0, )

Page 276: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

DNN Reduces Feature Engineering● It can learn to bucketize real-value features

● It can learn crosses

● As you add model weights it takes more data and time to train and overfitting becomes more of an issue

● Use L2 regularization or drop out to control overfitting

● Along with other hyperparameters (e.g. learning rate, num steps) you need to pick the DNN configuration (how many levels of hidden units and how many units at each level).

Page 277: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

Embeddings as a Tool

● Embeddings map items (e.g. movies, text,...) to low-dimensional real vectors in a way that similar items are close to each other

● Embeddings can also be applied to dense data (e.g. audio) to create a meaningful similarity metric

● Jointly embedding diverse data types (e.g. text, images, audio, …) define a similarity between them

Page 278: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

An Embedding Layer in a DNNRegresor

Words in real estate ad

...

3 Dimensional Embedding

Input to Embedding Layer is Sparse Vector Encoding

Latitude Longitude

DNNRegressorRegression problem to predict home sales prices:

Sale Price

Page 279: Machine Learning Introduction To - Google Search€¦ · Introduction To Machine Learning Portions of this course are from Machine Learning Crash Course. ... Learn to implement machine

An Embedding Layer in a DNNClassifier

Raw bitmap of the hand drawn digit

...

3 Dimensional Embedding

Input to Embedding Layer is Sparse Vector Encoding

Other features

...

Predicted probability for the 10 classes

Target Class Label

DNNClassifier

“One-hot” target prob dist. (sparse)

Multiclass Classification to predict a handwritten digit 0 1

0 1 2 3 4 5 6 7 8 9

0 2 3 4 5 6 7 8 9