27
Bayesian Learning, Part 1 of (probably) 4 Reading: Bishop Ch. 1.2, 1.5, 2.3

Bayesian Learning, Part 1 of (probably) 4

  • Upload
    kin

  • View
    14

  • Download
    0

Embed Size (px)

DESCRIPTION

Bayesian Learning, Part 1 of (probably) 4. Reading: Bishop Ch. 1.2, 1.5, 2.3. Administrivia. Office hours tomorrow moved: noon-2:00 Thesis defense announcement: Sergey Plis, Improving the information derived from human brain mapping experiments . - PowerPoint PPT Presentation

Citation preview

Page 1: Bayesian Learning, Part 1 of (probably) 4

Bayesian Learning, Part 1 of (probably)

4Reading: Bishop Ch. 1.2, 1.5, 2.3

Page 2: Bayesian Learning, Part 1 of (probably) 4

Administrivia•Office hours tomorrow moved:

•noon-2:00

•Thesis defense announcement:

•Sergey Plis, Improving the information derived from human brain mapping experiments.

•Application of ML/statistical techniques to analysis of MEG neuroimaging data

•Feb 21, 9:00-11:00 AM

•FEC 141; everybody welcome

Page 3: Bayesian Learning, Part 1 of (probably) 4

Yesterday, today, and...•Last time:

•Finish up SVMs

•This time:

•HW3

• Intro to statistical/generative modeling

•Statistical decision theory

•The Bayesian viewpoint

•Discussion of R1

Page 4: Bayesian Learning, Part 1 of (probably) 4

Homework (proj) 3•Data sets:

•MNIST Database of handwritten digits:• http://yann.lecun.com/exdb/mnist/

•One other (forthcoming)

•Algorithms:

•Decision tree: http://www.cs.waikato.ac.nz/ml/weka/

•Linear LSE classifier (roll your own)

•SVM (ditto, and compare to Weka’s)

•Gaussian kernel; poly degree 4, 10, 20; sigmoid

•Question: which algorithm is better on these data sets? Why? Prove it.

Page 5: Bayesian Learning, Part 1 of (probably) 4

HW 3 additional details•Due: Tues Mar 6, 2007, beginning of class

•2 weeks from today -- many office hours between now and then

•Feel free to talk to each other, but write your own code

•Must code LSE, SVM yourself; can use pre-packaged DT•Use a QP library/solver for SVM (e.g.,

Matlab’s quadprog() function)•Hint: QPs are sloooow for large data;

probably want to sub-sample data set.•Q’: what effect does this have?

•Extra credit: roll your own DT

Page 6: Bayesian Learning, Part 1 of (probably) 4

ML trivia of the day...•Which data mining techniques [have] you used

in a successfully deployed application?

htt

p:/

/w

ww

.kdnu

gg

ets

.com

/

Page 7: Bayesian Learning, Part 1 of (probably) 4

Assumptions•“Assume makes an a** out of U and ME”...

•Bull****

•Assumptions are unavoidable

• It is not possible to have an assumption-free learning algorithm

•Must always have some assumption about how the data works

•Makes learning faster, more accurate, more robust

Page 8: Bayesian Learning, Part 1 of (probably) 4

Example assumptions

•Decision tree:

•Axis orthogonality

• Impurity-based splitting

•Greedy search ok

•Accuracy (0/1 loss) objective function

Page 9: Bayesian Learning, Part 1 of (probably) 4

Example assumptions

•Linear discriminant (hyperplane classifier) via MSE:

•Data is linearly separable

•Squared-error cost

Page 10: Bayesian Learning, Part 1 of (probably) 4

Example assumptions

•Support vector machines

•Data is (close to) linearly separable...

• ... in some high-dimensional projection of input space

• Interesting nonlinearities can be captured by kernel functions

•Max margin objective function

Page 11: Bayesian Learning, Part 1 of (probably) 4

Specifying assumptions•Bayesian learning assumes:

•Data were generated by some stochastic process

•Can write down (some) mathematical form for that process

•CDF/PDF/PMF

•Mathematical form needs to be parameterized

•Have some “prior beliefs” about those params

Page 12: Bayesian Learning, Part 1 of (probably) 4

Specifying assumptions•Makes strong assumptions about form

(distribution) of data

•Essentially, an attempt to make assumptions explicit and to divorce them from learning algorithm

• In practice, not a single learning algorithm, but a recipe for generating problem-specific algs.

•Will work well to the extent that these assumptions are right

Page 13: Bayesian Learning, Part 1 of (probably) 4

Example•F={height, weight}

•Ω={male, female}

•Q1: Any guesses about individual distributions of height/weight by class?

•What probability function (PDF)?

•Q2: What about the joint distribution?

•Q3: What about the means of each?

•Reasonable guess for the upper/lower bounds on the means?

Page 14: Bayesian Learning, Part 1 of (probably) 4

Some actual data*

* Actual synthesized data, anyway...

Page 15: Bayesian Learning, Part 1 of (probably) 4

General idea

•Find probability distribution that describes classes of data

•Find decision surface in terms of those probability distributions

Page 16: Bayesian Learning, Part 1 of (probably) 4

H/W data as PDFs

Page 17: Bayesian Learning, Part 1 of (probably) 4

Or, if you prefer...

Page 18: Bayesian Learning, Part 1 of (probably) 4

General idea

•Find probability distribution that describes classes of data

•Find decision surface in terms of those probability distributions

•What would be a good rule?

Page 19: Bayesian Learning, Part 1 of (probably) 4

5 minutes of math

•Bayesian decision rule: Bayes optimality

•Want to pick the class that minimizes expected cost

•Simplest case: cost==misclassification

•Expected cost == expected misclassification rate

Page 20: Bayesian Learning, Part 1 of (probably) 4

5 minutes of math

•Expectation only defined w.r.t. a probability distribution:

•Posterior probability of class i given data x:

• Interpreted as: chance that the real class is , given that the observed data is x

Page 21: Bayesian Learning, Part 1 of (probably) 4

5 minutes of math•Expected cost is then:

•cost of getting it wrong * prob of getting it wrong

• integrated over all possible outcomes (true classes)

•More formally:

Cost of classifying a class j thing as a class i

Page 22: Bayesian Learning, Part 1 of (probably) 4

5 minutes of math•Expected cost is then:

•cost of getting it wrong * prob of getting it wrong

• integrated over all possible outcomes (true classes)

•More formally:

•Want to pick that minimizes this

Page 23: Bayesian Learning, Part 1 of (probably) 4

5 minutes of math•For 0/1 cost, reduces to:

Page 24: Bayesian Learning, Part 1 of (probably) 4

5 minutes of math•For 0/1 cost, reduces to:

•To minimize, pick the that minimizes:

Page 25: Bayesian Learning, Part 1 of (probably) 4

5 minutes of math• In pictures:

Page 26: Bayesian Learning, Part 1 of (probably) 4

5 minutes of math• In pictures:

Page 27: Bayesian Learning, Part 1 of (probably) 4

5 minutes of math•These thresholds are called the Bayes

decision thresholds

•The corresponding cost (err rate) is called the Bayes optimal cost

A real-worldexample: