Introduction to Machine Learning

Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth

Lecture 15: Online Learning: Stochastic Gradient Descent

Perceptron Algorithm Kernel Methods

Many figures courtesy Kevin Murphy’s textbook, Machine Learning: A Probabilistic Perspective

Batch versus Online Learning •! Many learning algorithms we’ve seen can be phrased as

batch minimization of the following objective:

•! This produces effective prediction algorithms, but can require significant computation and storage for training

(one of N data points) (for ML, similar for MAP)

•! We can do online learning from streaming data via stochastic gradient descent:

•! SGD takes a small step based on single observations “sampled” from the data’s empirical distribution:

p(z) =1

δzi(z)

Stochastic Gradient Descent •! How can we produce a single parameter estimate?

•! How should we set the step size ?

•! How does this work in practice?

p(z) =1

δzi(z)

Polyak-Ruppert averaging

ηkRobbins-Monro

conditions

Conventional batch step size rules fail (in theory & practice)

•! Excellent for big datasets, but tuning parameters tricky •! Refinement: Take batches of data for each step: 1 < B � N

Least Mean Squares (LMS)

0 5 10 15 20 25 303

RSS vs iterationblack line = LMS trajectory towards LS soln (red cross)

w0−1 0 1 2 3

−0.5

Stochastic gradient descent applied to linear regression model:

f(θ, yi, xi) =1

2(yi − θTφ(xi))

yk = θTk φ(xk)θk+1 = θk + ηk(yk − yk)φ(xk)

SGD for Logistic Regression

−10 −5 0 5 100

σ(z) = sigm(z) =1

1 + e−z

p(yi | xi, θ) = Ber(yi | µi)

µi = σ(θTφ(xi))

f(θ) = −N∑

[yi logµi + (1− yi) log(1− µi)]

•! Batch gradient function:

∇f(θ) =

(µi − yi)φ(xi)

•! Stochastic gradient descent:

θk+1 = θk + ηk(yk − µk)φ(xk)

µk = σ(θTk φ(xk)) 0 < µk < 1

Perceptron MARK 1 Computer

Frank Rosenblatt, late 1950s Decision Rule: yi = I(θTφ(xi) > 0)

Learning Rule: If yk = yk, θk+1 = θkIf yk �= yk, θk+1 = θk + ykφ(xk)

yk = 2yk − 1 ∈ {+1,−1}

Perceptron Algorithm Convergence

!1 !0.5 0 0.5 1!1

C. Bishop, Pattern Recognition & Machine Learning

Perceptron Algorithm Properties

•! Guaranteed to converge if data linearly separable (in feature space; reduces angle to true separators)

•! Easy to construct kernel representation of algorithm

Strengths

•! May be slow to converge (worst-case performance poor) •! If data not linearly separable, will never converge •! Solution depends on order data visited;

no notion of a best separating hyperplane •! Non-probabilistic: No measure of confidence in decisions,

difficult to generalize to other problems

Weaknesses

Covariance Matrices •! Eigenvalues and eigenvectors:

•! For a symmetric matrix:

•! For a positive semidefinite matrix:

•! For a positive definite matrix:

Σui = λiui, i = 1, . . . , d

λi ≥ 0

λi ∈ R uTi ui = 1 uT

i uj = 0

Σ = UΛUT =d∑

λiuiuTi

Σ−1 = UΛ−1UT =d∑

λiuiu

λi > 0

yi = uTi (x− µ)

Mercer Kernel Functions X arbitrary input space (vectors, functions, strings, graphs, !)

•! A kernel function maps pairs of inputs to real numbers:

k : X × X → R k(xi, xj) = k(xj , xi)

Intuition: Larger values indicate inputs are “more similar”

•! A kernel function is positive semidefinite if and only if for any , and any , the Gram matrix is positive semidefinite:

n ≥ 1 x = {x1, x2, . . . , xn}

K ∈ Rn×n Kij = k(xi, xj)

•! Mercer’s Theorem: Assuming certain technical conditions, every positive definite kernel function can be represented as

k(xi, xj) =d∑

φ�(xi)φ�(xj)for some feature mapping (but may need ) d → ∞

Exponential Kernels X real vectors of some fixed dimension

k(xi, xj) = exp

{−(|xi − xj |

We can construct a covariance matrix by evaluating kernel at any set of inputs, and then sample from the zero-mean Gaussian distribution with that covariance. This is a Gaussian process.

0 < γ ≤ 2

Polynomial Kernels X real vectors of some fixed dimension

•! The polynomial kernel has an explicit feature mapping, but the number of features grows exponentially with both the input dimension and the polynomial degree

•! The squared exponential kernel requires an infinite number of features (roughly, radial basis functions at all possible locations in the input space)

String Kernels X strings of characters from some finite alphabet, of size A

•! Feature vector: Count of number of times that every substring, of every possible length, occurs within string

•! Using suffix trees, the kernel can be evaluated in time linear in the length of the input strings

Amino Acids

D = A+A2 +A3 +A4 + · · ·

Kernels and Features •! What features lead to valid, positive semidefinite kernels?

•! When is a hypothesized kernel function positive semidefinite?

•! How can I build new kernel functions?

k(xi, xj) = φ(xi)Tφ(xj) φ : X → Rdis valid for any

Can be tricky to verify whether underlying feature mapping exists.

k(xi, xj) = ck(xi, xj), c > 0

k(xi, xj) = f(xi)k(xi, xj)f(xj) for any f(x)

k(xi, xj) = k1(xi, xj) + k2(xi, xj)

k(xi, xj) = k1(xi, xj)k2(xi, xj)

k(xi, xj) = exp(k(xi, xj))

Kernelizing Learning Algorithms •! Start with any learning algorithm based on features •! Manipulate steps in algorithm so that it depends not directly on

features, but only their inner products: •! Write code that only uses calls to kernel function •! Basic identity: Squared distance between feature vectors

φ(x)(Don’t worry that computing features might be expensive or impossible.)

k(xi, xj) = φ(xi)Tφ(xj)

•! Feature-based nearest neighbor classification •! Feature-based clustering algorithms (later) •! Feature-based nearest centroid classification:

||φ(xi)− φ(xj)||22 = k(xi, xi) + k(xj , xj)− 2k(xi, xj)

ytest = argminc

||φ(xtest)− µc||2

µc =1

i|yi=c

φ(xi)mean of the Nc training examples of class c

Kernelized Perceptron Algorithm Decision Rule:

Learning Rule: If yk = yk, θk+1 = θkIf yk �= yk, θk+1 = θk + ykφ(xk)

yk = 2yk − 1 ∈ {+1,−1}Problem: May be intractable to compute/store φ(xk), θk

ytest = I(θTφ(xtest) > 0)

Decision Rule:

Learning Rule:

ytest = I

sik(xtest, xi) > 0

If yk = yk, sk,k+1 = sk,kIf yk �= yk, sk,k+1 = sk,k + yk

Representation: D feature weights

Initialize with . By induction, for all k θ0 = 0

sikφ(xi) for some integers sik

Representation: N training example weights

Introduction to Machine Learning - Brown...

Documents

Dzień Przedsiębiorczości na stanowisku architekta wnętrz 22.03.2012

COM24111: Machine Learning Decision Trees Gavin Brown gbrown

Prime Pagine Quotidiani 22.03.2012

WP6375 - yearbooklife.com · ITC Kabel The quick brown fox jumps over a lazy dog! ITC Machine The quick brown fox jumps over a lazy dog! ITC Avant Garde The quick brown fox jumps

Газета «Правда» №12 от 22.03.2012

SIU SIUs rolle og ansvarsområde Toulouse, 22.03.2012 Studietur for rådgivere Vidar Pedersen

Mobilas aplikācijas - Adwards / Latvijas Institūts (22.03.2012)

BROWN UNIVERSITY Department of Computer Science …cs.brown.edu/research/pubs/theses/masters/1991/boyer.pdf · BROWN UNIVERSITY Department of Computer Science ... By building a machine

rassegna stampa 22.03.2012

Presentazione IMAT cantieri 22.03.2012

Special Topics in Machine Learning Brown University CSCI

31 Urban-Nexus SDPF 22.03.2012 BG

Доклад Н.Прачук (Mar Le Mar Club) - 22.03.2012

Finite State Machine - Dept of Computer & Electrical ...vvakilian/CourseECE322/LectureNotes/Lecture19.pdf · Finite state machine Stephen Brown and Zvonko Vranesic, Fundamentals of

Nuori ja lahjakas Nina Mya · 2012-07-17 · Nuori ja lahjakas Nina Mya. Eija Talo-Oksala - Kuvat: Olli Sulin 22.03.2012 09:59 - Viimeksi päivitetty 22.03.2012 10:20. Nina on menestyksekkäästi

Machine Learning Computer Vision James Hays, Brown Slides: Isabelle Guyon, Erik Sudderth, Mark Johnson, Derek Hoiem Photo: CMU Machine Learning Department

Brown Hills College of Engineering & Technology Machine ... · Brown Hills College of Engineering & Technology Machine ... transmit depends on the co-efficient of friction ... more

Infarma 2012 - Ponencia SIGRE (22.03.2012)

PRECISION GUN DRILLING FACILITY … · Drilling Machine El Dorado Mega M75-1041 ¾” Cap Gun Drilling Machine Brown & Sharpe LMation ¾” Cap Knee Type Gun Drilling Machine w/ Newall

Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-03-06_logisticRegressBayes.pdf · 2012. 3. 6. · Introduction to Machine Learning Brown University CSCI 1950-F,