CSE446:’PAC*learning,’’ VCDimension...

CSE446: PAC-‐learning, VC Dimension Winter 2015

Luke ZeBlemoyer

Slides adapted from Carlos Guestrin

What now…

•  We have explored many ways of learning from data

•  But… – How good is our classifier, really? – How much data do I need to make it “good enough”?

A simple seUng… •  ClassificaVon

– m data points –  Finite number of possible hypothesis (e.g., dec. trees of depth d)

•  A learner finds a hypothesis h that is consistent with training data – Gets zero error in training – errortrain(h) = 0

•  What is the probability that h has more than ε true error? –  errortrue(h) ≥ ε

How likely is a bad hypothesis to get m data points right?

•  Hypothesis h that is consistent with training data –  got m i.i.d. points right –  h “bad” if it gets all this data right, but has high true error – What is the probability of this happening?

•  Prob. h with errortrue(h) ≥ ε gets randomly drawn data point right

•  Prob. h with errortrue(h) ≥ ε gets m iid data points right

P(errortrue(h) ≥ ε, gets one data point right) ≤ 1-ε

P(errortrue(h) ≥ ε, gets m iid data point right) ≤ (1-ε)m

But there are many possible hypothesis that are consistent with training data

Hc ⊆H consistent with data

•  Which classifier should be learn? –  and how to we generalize the bounds?

•  We want to make as few assumpVons as possible!

•  So, pick any h∈Hc

•  But wait, we had a bound on a single h, now we need to bound the worst h∈Hc

Union bound •  P(A or B or C or D or …)

≤ P(A) + P(B) + P(C) + P(D) + …

Q: Is this a tight bound? Will it be useful?

How likely is learner to pick a bad hypothesis

There are k hypothesis consistent with data – How likely is learner to pick a bad one? – We need to a bound that holds for all of them!

P(errortrue(h) ≥ ε, gets m iid data point right) ≤ (1-ε)m

P(errortrue(h1) ≥ ε OR errortrue(h1) ≥ ε OR … OR errortrue(hk) ≥ ε)

≤ ∑kP(errortrue(hk) ≥ ε) ß Union bound

≤ ∑k(1-ε)m ß bound on individual hjs

≤ |H|(1-ε)m ß k ≤ |H|

≤ |H| e-mε ß (1-ε) ≤ e-ε for 0≤ε≤1

GeneralizaVon error in finite hypothesis spaces [Haussler ’88]

•  Theorem: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data:

Using a PAC bound •  Typically, 2 use cases:

–  1: Pick ε and δ, compute m –  2: Pick m and δ, compute ε

Argument: For all h we know that

so, with probability 1-δ the following holds…

⇤(x) =

⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⇧

. . .x(n)

x(1)x(2)

x(1)x(3)

��⌃

⌅w= w �

�jyjxj

⇤(u).⇤(v) =

�u1u2

�v1v2

⇥= u1v1 + u2v2 = u.v

⇤(u).⇤(v) =

⌥⌥⇧

u21u1u2u2u1u22

��⌃ .

⌥⌥⇧

v21v1v2v2v1v22

��⌃ = u21v21 + 2u1v1u2v2 + u22v

= (u1v1 + u2v2)2

= (u.v)2

⇤(u).⇤(v) = (u.v)d

P (errortrue(h) ⇥ ⇥) ⇥ |H|e�m�

⌅(x) =

⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⇧

. . .x(n)

x(1)x(2)

x(1)x(3)

��⌃

⇧w= w �

�jyjxj

⌅(u).⌅(v) =

�u1u2

�v1v2

⇥= u1v1 + u2v2 = u.v

⌅(u).⌅(v) =

⌥⌥⇧

u21u1u2u2u1u22

��⌃ .

⌥⌥⇧

v21v1v2v2v1v22

��⌃ = u21v21 + 2u1v1u2v2 + u22v

= (u1v1 + u2v2)2

= (u.v)2

⌅(u).⌅(v) = (u.v)d

P (errortrue(h) ⇥ ⇤) ⇥ |H|e�m� ⇥ ⇥

⌅(x) =

. . .x(n)

x(1)x(2)

x(1)x(3)

⌦⌦⌦⌦⌦⌦⌦⌦⌦⌦⌦�

⇧w= w �

�jyjxj

⌅(u).⌅(v) =

⇤u1u2

⇤v1v2

⌅= u1v1 + u2v2 = u.v

⌅(u).⌅(v) =

u21u1u2u2u1u22

⌦⌦� .

v21v1v2v2v1v22

⌦⌦� = u21v21 + 2u1v1u2v2 + u22v

= (u1v1 + u2v2)2

= (u.v)2

⌅(u).⌅(v) = (u.v)d

ln�|H|e�m�

⇥⇥ ln ⇥

⌅(x) =

. . .x(n)

x(1)x(2)

x(1)x(3)

⌦⌦⌦⌦⌦⌦⌦⌦⌦⌦⌦�

⇧w= w �

�jyjxj

⌅(u).⌅(v) =

⇤u1u2

⇤v1v2

⌅= u1v1 + u2v2 = u.v

⌅(u).⌅(v) =

u21u1u2u2u1u22

⌦⌦� .

v21v1v2v2v1v22

⌦⌦� = u21v21 + 2u1v1u2v2 + u22v

= (u1v1 + u2v2)2

= (u.v)2

⌅(u).⌅(v) = (u.v)d

ln�|H|e�m�

⇥⇥ ln ⇥

ln |H|�m⇤ ⇥ ln ⇥

⌅(x) =

. . .x(n)

x(1)x(2)

x(1)x(3)

⌦⌦⌦⌦⌦⌦⌦⌦⌦⌦⌦�

⇧w= w �

�jyjxj

⌅(u).⌅(v) =

⇤u1u2

⇤v1v2

⌅= u1v1 + u2v2 = u.v

⌅(u).⌅(v) =

u21u1u2u2u1u22

⌦⌦� .

v21v1v2v2v1v22

⌦⌦� = u21v21 + 2u1v1u2v2 + u22v

= (u1v1 + u2v2)2

= (u.v)2

⌅(u).⌅(v) = (u.v)d

P (errortrue(h) ⇥ ⇤) ⇥ |H|e�m⇥ ⇥ ⇥

ln�|H|e�m⇥

⇥⇥ ln ⇥

ln |H|�m⇤ ⇥ ln ⇥

m ⇤ln |H|+ ln 1�

Case 1

⌅(x) =

. . .x(n)

x(1)x(2)

x(1)x(3)

⌦⌦⌦⌦⌦⌦⌦⌦⌦⌦⌦�

⇧w= w �

�jyjxj

⌅(u).⌅(v) =

⇤u1u2

⇤v1v2

⌅= u1v1 + u2v2 = u.v

⌅(u).⌅(v) =

u21u1u2u2u1u22

⌦⌦� .

v21v1v2v2v1v22

⌦⌦� = u21v21 + 2u1v1u2v2 + u22v

= (u1v1 + u2v2)2

= (u.v)2

⌅(u).⌅(v) = (u.v)d

P (errortrue(h) ⇥ ⇤) ⇥ |H|e�m⇥ ⇥ ⇥

ln�|H|e�m⇥

⇥⇥ ln ⇥

ln |H|�m⇤ ⇥ ln ⇥

m ⇤ln |H|+ ln 1�

⇤ ⇤ln |H|+ ln 1�

Case 2

Log dependence on |H|, ok if exponential size (but not doubly)

ε shrinks at rate O(1/m) ε has stronger influence than δ

LimitaVons of Haussler ‘88 bound

•  Do we really want to pick a consistent hypothesis h? (where errortrain(h)=0)

•  Size of hypothesis space – What if |H| is really big? – What if it is conVnuous?

•  First Goal: Can we get a bound for a learner with errortrain(h) in training set?

QuesVon: What’s the expected error of a hypothesis?

•  The error of a hypothesis is like esVmaVng the parameter of a coin!

•  Chernoff bound: for m i.i.d. coin flips, x1,…,xm, where xi ∈ {0,1}. For 0<ε<1:

GeneralizaVon bound for |H| hypothesis

•  Theorem: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h:

Why? Same reasoning as before. Use the Union bound over individual Chernoff bounds

PAC bound and Bias-‐Variance tradeoff

Important: PAC bound holds for all h, but doesn’t guarantee that algorithm finds best h!!!

or, after moving some terms around, with probability at least 1-δ:

PAC bound and Bias-‐Variance tradeoff for all h, with probability at least 1-δ:

“bias” “variance” •  For large |H|

–  low bias (assuming we can find a good h) –  high variance (because bound is looser)

•  For small |H| –  high bias (is there a good h?) –  low variance (Vghter bound)

PAC bound: How much data?

•  Given δ,ε how big should m be?

!2005-2007 Carlos Guestrin 17

PAC bound and Bias-Variance

tradeoff

! Important: PAC bound holds for all h,

but doesn’t guarantee that algorithm finds best h!!!

or, after moving some terms around,with probability at least 1-$:

What about the size of the

hypothesis space?

! How large is the hypothesis space?

Decision Trees

•  Bound number of decision trees with depth k with data that has n features:

•  Bad!!! Need exponenVally many data points (in k)!!!

•  But, for m data points, tree can’t get too big… –  Number of leaves never more than number data points –  Instead, lets bound number of decision trees with k leaves

PAC bound and Bias-Variance

tradeoff

! Important: PAC bound holds for all h,

but doesn’t guarantee that algorithm finds best h!!!

or, after moving some terms around,with probability at least 1-$:

What about the size of the

hypothesis space?

! How large is the hypothesis space?

PAC bound for decision trees of

depth k

! Bad!!!

" Number of points is exponential in depth!

! But, for m data points, decision tree can’t get too big…

Number of leaves never more than number data points

Number of decision trees with k leaves

Hk = Number of decision trees with k leaves

Loose bound: Reminder:

PAC bound for decision trees with k leaves – Bias-‐Variance revisited

Bias / variance again •  k << m: high bias, low variance •  k=m: no bias, high variance •  k>m: we would never do this!!!

What did we learn from decision trees?

•  Bias-‐Variance tradeoff formalized

•  Moral of the story: Complexity of learning not measured in terms of size hypothesis space, but in maximum number of points that allows consistent classificaVon

What about conVnuous hypothesis spaces?

•  ConVnuous hypothesis space: – |H| = ∞ –  Infinite variance???

•  As with decision trees, only care about the maximum number of points that can be classified exactly!

How many points can a linear boundary classify exactly? (1-‐D)

2 Points:

3 Points:

etc (8 total)

ShaBering and VC Dimension

A set of points is sha3ered by a hypothesis space H iff:

– For all ways of spli6ng the examples into posiVve and negaVve subsets

– There exists some consistent hypothesis h

The VC Dimension of H over input space X – The size of the largest finite subset of X shaBered by H

How many points can a linear boundary classify exactly? (2-‐D)

3 Points:

4 Points:

How many points can a linear boundary classify exactly? (d-‐D)

•  A linear classifier w0+∑j=1..dwjxj can represent all assignments of possible labels to d+1 points –  But not d+2!! –  Bias term w0 required! –  Rule of Thumb: number of parameters in model oxen matches max number of points

•  QuesVon: Can we get a bound for error in as a funcVon of the number of points that can be completely labeled?

PAC bound using VC dimension

•  VC dimension: number of training points that can be classified exactly (shaBered) by hypothesis space H!!! – Measures relevant size of hypothesis space, as with decision trees with k leaves

•  Same bias / variance tradeoff as always –  Now, just a funcVon of VC(H)

Examples of VC dimension

•  Linear classifiers: – VC(H) = d+1, for d features plus constant term b

•  Neural networks (we will see this next) – VC(H) = #parameters – Local minima means NNs will probably not find best parameters

•  1-‐Nearest neighbor – VC(H) = ∞

•  SVM with Gaussian Kernel – VC(H) = ∞

What you need to know •  Finite hypothesis space

–  Derive results –  CounVng number of hypothesis – Mistakes on Training data

•  Complexity of the classifier depends on number of points that can be classified exactly –  Finite case – decision trees –  Infinite case – VC dimension

•  Bias-‐Variance tradeoff in learning theory •  Remember: will your algorithm find best classifier?

CSE446:’PAC*learning,’’ VCDimension...

Documents

PAC Manager User’s Guide - Opto 22...PAC MANAGER USER’S GUIDE SNAP PAC R-Series Controllers SNAP PAC S-Series Controllers SoftPAC Controllers SNAP PAC EB Brains SNAP PAC SB Brains

Ubiquitous Computing: Beyond platforms, beyond screens ...uwdata.github.io/hcid520/15wi/lectures/UbiComp-Patel.pdf · Ubiquitous Computing: Beyond platforms, beyond screens; the invisible

CSE446 S OFTWARE Q UALITY M ANAGEMENT Spring 2014 Yazılım ve Uyguluma Geliştirme Yöneticisi Orhan Başar Evren

PAC · 2020-05-15 · PAC Klimasysteme für kommerzielle und industrielle Klimatisierung PAC Klimasysteme Unsere neue Hisense PAC Serie bietet effiziente Klimasysteme für den Einsatz

CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

CSE373: Data Structures & Algorithms Lecture 8: AVL Trees ...courses.cs.washington.edu/courses/cse373/15wi/lectures/lecture8.pdf · Announcements • Homework 3 is out. • Today

PAC+ PAC+R

NYP Queens PPS PAC Meeting PAC Presentatio… · 2 NYP/Q PPS PAC Meeting. NYP Queens PPS PAC Meeting 3 NYPQ PPS PAC Meeting DSRIP Updates. DSRIP Updates Key Dates •9/26: Medical

README: PAC Project - Opto 22documents.opto22.com/readme_PAC_Project.pdf · 2017-12-08 · README: PAC Project R9.2000 March 9, 2012 Contents PAC Control, page 3 PAC Redundancy Manager,

ECONOMIC & COMMUNITY DEVELOPMENT COMMITTEE · • Who is the PAC? • What does the PAC do? • Work of PAC 2003-2005 • Work of PAC 2005-2008 • Work of PAC 2008-2013 • Work

SNAP PAC R-Series Controller User’s Guide · snap pac r-series controller user’s guide snap-pac-r1 snap-pac-r1-b snap-pac-r1-fm snap-pac-r1-w snap-pac-r2 snap-pac-r2-fm snap-pac-r2-w

Marathon Pac Bulk Wire Systems - ECOMEX d.o.o. Čačak · 2018-06-18 · fabricators. The complete Marathon Pac family consists of: • Standard Marathon Pac • Jumbo Marathon Pac

PAC-KD45CDF PAC-KD46CDF PAC-KD47CDF PAC …...取付説明書三菱電機ビル・店舗用エアコン別売部品下吸込キャンバスダクト形名 PAC-KD45CDF PAC-KD46CDF

Machine Learning CSE446 - courses.cs.washington.edu

Czytnik RFID PAC-PUG PAC-PUB - tme.eu · Dokumentacja techniczna Czytnik RFID PAC-PUG PAC-PUB wersja dokumentacji:PAC-PU-MAN-V2 obowi ązuje od wersji firmware PAC-PU-v3.0 PAC-PUG

DECEMBER • 2015 NCPA PAC UPDATE · PAC Reception at NCPA’s 2015 Annual Convention NCPA PAC chairman, Steve Giroux, speaks at NCPA PAC Reception. EACH YEAR, NCPA HOLDS A PAC Fundraising

Point Estimation - University of Washington › courses › cse446 › 13sp › ... · 2013-04-05 · 2 ©2005-2013 Carlos Guestrin 3 Thumbtack – Binomial Distribution ! P(Heads)

PAC-Bayesian Bounds and Aggregation: Introduction, and ... · Introduction : Learning with PAC-Bayes Bounds Three Types of PAC-Bayesian Bounds Computational Issues PAC-BayesianBoundsandAggregation:

HSD130 PAC Dry Cutting System HSD130 PAC ConsoleFuel Gas

“redefining higher education”...Rafting (PAC 307) Rock Climbing I (PAC 315) Rock Climbing II (PAC 316) Rock Climbing III (PAC 317) Snow Shoeing (PAC 305) Snow Travel and Camping