76
December, 2008 © 2008, Jaime G Carbonell Jaime Carbonell (with contributions from Tom Mitchell, Sebastian Thrun and Yiming Yang) Carnegie Mellon University [email protected] Machine Learning & Data Mining Part 1: The Basics

December, 2008© 2008, Jaime G Carbonell Jaime Carbonell (with contributions from Tom Mitchell, Sebastian Thrun and Yiming Yang) Carnegie Mellon University

Embed Size (px)

Citation preview

December, 2008 © 2008, Jaime G Carbonell

Jaime Carbonell (with contributions from Tom Mitchell, Sebastian Thrun and Yiming Yang)

Carnegie Mellon [email protected]

Machine Learning & Data MiningPart 1: The Basics

December, 2008 © 2008, Jaime G. Carbonell 2

Some Definitions (KBS vs ML) Knowledge-Based Systems

Rules, procedures, semantic nets, Horn clauses Inference: matching, inheritance, resolution Acquisition: manually from human experts

Machine Learning Data: tables, relations, attribute lists, … Inference: rules, trees, decision functions, … Acquisition: automated from data

Data Mining Machine learning applied to large real problems May be augmented with KBS

December, 2008 © 2008, Jaime G. Carbonell 3

Ingredients for Machine Learning

“Historical” data (e.g. DB tables) E.g. products (features, marketing, support, …) E.g. competition (products, pricing, customers) E.g. customers (demographics, purchases, …)

Objective function (to be predicted or optimized) E.g. maximize revenue per customer E.g. minimize manufacturing defects

Scalable machine learning method(s) E.g. decision-tree induction, logistic regression E.g. “active” learning, clustering

December, 2008 © 2008, Jaime G. Carbonell 4

Sample ML/DM Applications I

Credit Scoring Training: past applicant profiles, how much

credit given, payback or default Input: applicant profile (income, debts, …) Objective: credit-score + max amount

Fraud Detection (e.g. credit-card transactions) Training: past known legitimate & fraudulent

transactions Input: proposed transaction (loc, cust, $$, …) Objective: approve/block decision

December, 2008 © 2008, Jaime G. Carbonell 5

Sample ML/DM Applications II

Demographic Segmentation Training: past customer profiles (age, gender,

education, income,…) + product preferences Input: new product description (features) Objective: predict market segment affinity

Marketing/Advertisement Effectiveness Training: past advertisement campaigns,

demographic targets, product categories Input: proposed advertisement campaign Objective: project effectiveness (sales

increase modulated by marketing cost)

December, 2008 © 2008, Jaime G. Carbonell 6

Sample ML/DM Applications III

Product (or Part) Reliability Training: past products/parts + specs at

manufacturing + customer usage + maint rec Input: new part + expected usage Objective: mean-time-to-failure (replacement)

Manufacturing Tolerances Training: past product/part manufacturing

process, tolerances, inspections, … Input: new part + expected usage Objective: optimal manufacturing precision

(minimize costs of failure + manufacture)

December, 2008 © 2008, Jaime G. Carbonell 7

Sample ML/DM Applications IV

Mechanical Diagnosis Training: past observed symptoms at (or prior

to) breakdown + underlying cause Input: current symptoms Objective: predict cause of failure

Mechanical Repair Training: cause of failure + product usage +

repair (or PM) effectiveness Input: new failure cause + product usage Objective: recommended repair (or preventive

maintenance operation)

December, 2008 © 2008, Jaime G. Carbonell 8

Sample ML/DM Applications V

Billeting (job assignments) Training: employee profiles, position profiles,

employee performance in assigned position Input: new employee or new position profile Objective: predict performance in position

Text Mining & Routing (e.g. customer centers) Training: electronic problem reports, customer

requests + who should handle them Input: new incoming texts Objective: Assign category + route or reply

December, 2008 © 2008, Jaime G. Carbonell 9

Preparing Historical Data Extract a DB table with all the needed information

Select, join, project, aggregate, … Filter out rows with significant missing data

Determine predictor attributes (columns) Ask domain expert for relevant attributes, or Start with all attributes and automatically sub-

select most predictive ones (feature selection) Determine to-be-predicted attribute (column)

Objective of the DM (number, decision, …)

December, 2008 © 2008, Jaime G. Carbonell 10

Sample DB Table [predictor attributes] [objective]

Tot Num Max NumAcct. Income Job Delinq Delinq Owns Credit Goodnumb. in K/yr Now? accts cycles home? years cust.?---------------------------------------------------------------------------1001 85 Y 1 1 N 2 Y1002 60 Y 3 2 Y 5 N1003 ? N 0 0 N 2 N1004 95 Y 1 2 N 9 Y1005 110 Y 1 6 Y 3 Y1006 29 Y 2 1 Y 1 N1007 88 Y 6 4 Y 8 N1008 80 Y 0 0 Y 0 Y1009 31 Y 1 1 N 1 Y1011 ? Y ? 0 ? 7 Y1012 75 ? 2 4 N 2 N1013 20 N 1 1 N 3 N1014 65 Y 1 3 Y 1 Y1015 65 N 1 2 N 8 Y1016 20 N 0 0 N 0 N1017 75 Y 1 3 N 2 N1018 40 N 0 0 Y 1 Y

December, 2008 © 2008, Jaime G. Carbonell 11

Supervised Learning on DB Table Given: DB table

With identified predictor attributes x1, x2,…

And objective attribute y Find: Prediction Function

Subject to: Error Minimization on data table M

Least-squares error, or L1-norm, or L-norm, …

yxxF nk ,...,: 1 },...,{ 21 mk FFFF

]))(([min 2

)(),...{ 1

MRowsi

ikifff

best xfyArgfmk

December, 2008 © 2008, Jaime G. Carbonell 12

Popular Predictor Functions Linear Discriminators (next slides) k-Nearest-Neighbors (lecture #2) Decision Trees (lecture #5) Linear & Logistic Regression (lecture #4) Probabilistic Methods (Lecture #3) Neural Networks

2-layer Logistic Regression Multi-layer Difficult to scale up

Classification Rule Induction (in a few slides)

December, 2008 © 2008, Jaime G. Carbonell 13

Linear Discriminator Functions

Two class problem:

y={ , }

x1

x2

December, 2008 © 2008, Jaime G. Carbonell 14

Linear Discriminator Functions

Two class problem:

y={ , }

x1

x2

December, 2008 © 2008, Jaime G. Carbonell 15

Linear Discriminator Functions

Two class problem:

y={ , }

x1

x2

i

n

ii xay

0

December, 2008 © 2008, Jaime G. Carbonell 16

Linear Discriminator Functions

Two class problem:

y={ , }

x1

x2

new

i

n

ii xay

0

December, 2008 © 2008, Jaime G. Carbonell 17

Issues with Linear Discriminators What is the “best” placement of the discriminator?

Maximize the margin In general Support Vector Machines

What if there are k classes (K > 2)? Must learn k different discriminators Each discriminates ki vs kji (all other classes)

What if it classes are not linearly separable? Minimal error (L1 or L2) placement (regression) Give up on linear discriminators ( other fk’s)

December, 2008 © 2008, Jaime G. Carbonell 18

Maximizing the Margin

Two class problem:

y={ , }

x1

x2

margin

December, 2008 © 2008, Jaime G. Carbonell 19

Nearly-Separable Classes

Two class problem:

y={ , }

x1

x2

December, 2008 © 2008, Jaime G. Carbonell 20

Nearly-Separable Classes

Two class problem:

y={ , }

x1

x2

December, 2008 © 2008, Jaime G. Carbonell 21

Minimizing Training Error Optimal placing of maximum-margin separator

Quadratic programming (Support Vector Machines) Slack variables to accommodate training errors

Minimizing error metrics Number of errors

Magnitude of error

Squared error

Chevycheff norm

)),((1

),,(..1

0 ini

i yxfIn

yXfL

)),(())((1

),,( 2

..12 iii

nii yxfIyxf

nyXfL

),(()(),,(..1

1 iini

ii yxfIyxfyXfL

)),(())((max),,(..1

iiiini

yxfIyxfyXfL

December, 2008 © 2008, Jaime G. Carbonell 22

Symbolic Rule Induction

General idea Labeled instances are DB tuples Rules are generalized tuples Generalization occurs at terms in tuples Generalize on new E+ not correctly predicted Specialize on new E- not correctly predicted Ignore predicted E+ or E- (error-driven learning)

December, 2008 © 2008, Jaime G. Carbonell 23

Symbolic Rule Induction (2)

Example term generalizations Constant => disjunction

e.g. if small portion value set seen Constant => least-common-generalizer class

e.g. if large portion of value set seen Number (or ordinal) => range

e.g. if dense sequential sampling

Symbolic Rule Induction Example (1)

Age Gender Temp b-cult c-cult loc Skin disease65 M 101 + .23 USA normal strep25 M 102 + .00 CAN normal strep65 M 102 - .78 BRA rash dengue36 F 99 - .19 USA normal *none*11 F 103 + .23 USA flush strep88 F 98 + .21 CAN normal *none*39 F 100 + .10 BRA normal strep12 M 101 + .00 BRA normal strep15 F 101 + .66 BRA flush dengue20 F 98 + .00 USA rash *none*81 M 98 - .99 BRA rash ec-1287 F 100 - .89 USA rash ec-1212 F 102 + ?? CAN normal strep

14 F 101 + .33 USA normal67 M 102 + .77 BRA rash

Symbolic Rule Induction Example (2)

Candidate Rules:

IF age = [12,65]gender = *any*temp = [100,103]b-cult = +c-cult = [.00,.23]loc = *any*skin = (normal,flush)

THEN: strep

IF age = (15,65)gender = *any*temp = [101,102]b-cult = *any*c-cult = [.66,.78]loc = BRAskin = rash

THEN: dengue

Disclaimer: These are not real medical records or rules

December, 2008 © 2008, Jaime G. Carbonell 26

Types of Data Mining “Supervised” Methods (this DM course)

Training data has both predictor attributes & objective (to be predicted) attributes

Predict discrete classes classification Predict continuous values regression Duality: classification regression

“Unsupervised” Methods Training data without objective attributes Goal: find novel & interesting patterns Cutting-edge research, fewer success stories Semi-supervised methods: market-basket, …

December, 2008 © 2008, Jaime G. Carbonell 27

Machine Learning Application Process in a Nutshell Choose problem where

Prediction is valuable and non-trivial Sufficient historical data is available The objective is measurable (incl in past data)

Prepare the data Tabular form, clean, divide training & test sets

Select a Machine Learning algorithm Human readable decision fn rules, trees, … Robust with noisy data kNN, logistic reg, …

December, 2008 © 2008, Jaime G. Carbonell 28

Machine Learning Application Process in a Nutshell (2) Train ML Algorithm on Training Data Set

Each ML method has different training process Training uses both predictor & objective att’s

Run Training ML Algorithm on Test Data Set Test uses only predictor att’s & outputs

predictions on objective attributes Compare predictions vs actual objective att’s

(see lecture 2 for evaluation metrics) If Accuracy threshold, done.

Else, try different ML algorithm, different parameter settings, get more training data, …

December, 2008 © 2008, Jaime G. Carbonell 29

Sample DB Table (same) [predictor attributes] [objective]

Tot Num Max NumAcct. Income Job Delinq Delinq Owns Credit Goodnumb. in K/yr Now? accts cycles home? years cust.?---------------------------------------------------------------------------1001 85 Y 1 1 N 2 Y1002 60 Y 3 2 Y 5 N1003 ? N 0 0 N 2 N1004 95 Y 1 2 N 9 Y1005 100 Y 1 6 Y 3 Y1006 29 Y 2 1 Y 1 N1007 88 Y 6 4 Y 8 N1008 80 Y 0 0 Y 0 Y1009 31 Y 1 1 N 1 Y1011 ? Y ? 0 ? 7 Y1012 75 ? 2 4 N 2 N1013 20 N 1 1 N 3 N1014 65 Y 1 3 Y 1 Y1015 65 N 1 2 N 8 Y1016 20 N 0 0 N 0 N1017 75 Y 1 3 N 2 N1018 40 N 0 0 Y 10 Y

December, 2008 © 2008, Jaime G. Carbonell 30

Feature Vector Representation Predictor-attribute rows in DB tables can be

represented as vectors. For instance, the 2nd & 4th rows of predictor attributes in our DB table are:

R2 = [60 Y 3 2 Y 5]

R4 = [95 Y 1 2 N 9]

Converting to numbers (Y = 1, N = 0), we get:

R2 = [60 1 3 2 1 5]

R4 = [95 1 1 2 0 9]

December, 2008 © 2008, Jaime G. Carbonell 31

Vector Similarity Suppose we have a new credit applicant

R-new = [65 1 1 2 0 10]

To which of R2 or R4 is she closer?

R2 = [60 1 3 2 1 5]

R4 = [95 1 1 2 0 9]

What should we use as a SIMILARITY METRIC? Should we first NORMALIZE the vectors?

If not, the largest component will dominate

December, 2008 © 2008, Jaime G. Carbonell 32

Normalizing Vector Attributes Linear Normalization (often sufficient)

Find max & min values for each attribute Normalize each attribute by:

Apply to all vectors (historical + new) …by normalizing each attribute, e.g.:

)(

)(

minmax

min

AA

AAA actual

norm

5.0)20100()2060(1,2 RA

December, 2008 © 2008, Jaime G. Carbonell 33

Normalizing Full Vectors Normalizing the new applicant vector

R-new = [65 1 1 2 0 10] [.56 1 .17 .33 0 1] And normalizing the two past customer vectors

R2 = [60 1 3 2 1 5] [.50 1 .50 .33 1 .50] R4 = [95 1 1 2 0 9] [.94 1 .17 .33 0 .90]

How about if some attributes are known to be more important, say salary (A1) & delinquencies (A3)? Weight accordingly, e.g. x2 for each E.g., R-new-weighted: [1.12 1 .34 .33 0 1]

December, 2008 © 2008, Jaime G. Carbonell 34

Similarity Functions (inverse dist) Now that we have weighted normalized vectors,

how do we tell exactly their degree of similarity? Inverse sum of differences (L1)

Inverse Euclidean distance (L2)

||

1),(

,...1i

nii

diffinv babasim

2

,...1

)(

1),(

ini

i

Euclidba

basim

December, 2008 © 2008, Jaime G. Carbonell 35

Similarity Functions (direct) Dot-Product Similarity

Cosine Similarity (dot product of unit vectors)

ini

idot bababasim

,...,1

),(

nii

nii

ini

i

ba

ba

ba

babasim

,...,1

2

,...,1

2

,...,1cos ),(

December, 2008 © 2008, Jaime G. Carbonell 36

Alternative: Similarity Matrix for Non-Numeric Attributes

tiny little small medium large hugetiny 1.0 0.8 0.7 0.5 0.2 0.0little 1.0 0.9 0.7 0.3 0.1small 1.0 0.7 0.3 0.2medium 1.0 0.5 0.3large 1.0 0.8huge 1.0

Diagonal must be 1.0 Monotonicity property must hold Triangle inequality must hold Transitive property must hold Additivity/Compostionality need not hold

December, 2008 © 2008, Jaime G. Carbonell 37

k-Nearest Neighbors Method No explicit “training” phase When new case arrives (vector of predictor att’s)

Find nearest k neighbors (max similarity) among previous cases (row vectors in DB table)

k-neighbors vote for objective attribute Unweighted majority vote, or Similarity-weighted vote

Works for both discrete or continuous objective attributes

December, 2008 © 2008, Jaime G. Carbonell 38

Similarity-Weighted Voting in kNN If the Objective Attribute is Discrete:

If the Objective Attribute is Continuous:

])([&)]([)(),(maxarg)(

ijobjjobji CxvalueykNNxj

ValueRangeCobj yxsimyValue

)(

)(

),(

),()(

)(

ykNNxj

ykNNxjjobj

obj

j

j

yxsim

yxsimxvalue

yValue

December, 2008 © 2008, Jaime G. Carbonell 39

Applying kNN to Real Problems 1 How does one choose the vector representation?

Easy: Vector = predictor attributes What if attributes are not numerical?

Convert: (e.g. High=2, Med=1, Low=0), Or, use similarity function over nominal values

E.g. equality or edit-distance on strings How does one choose a distance function?

Hard: No magic recipe; try simpler ones first This implies a need for systematic testing

(discussed in coming slides)

December, 2008 © 2008, Jaime G. Carbonell 40

Applying kNN to Real Problems 2

How does one determine whether data should be normalized? Normalization is usually a good idea One can try kNN both ways to make sure

How does one determine “k” in kNN? k is often determined empirically Good start is:

))((log2 DBsizek

December, 2008 © 2008, Jaime G. Carbonell 41

Evaluating Machine Learning Accuracy = Correct-Predictions/Total-Predictions

Simplest & most popular metric But misleading on very-rare event prediction

Precision, recall & F1 Borrowed from Information Retrieval Applicable to very-rare event prediction

Correlation (between predicted & actual values) for continuous objective attributes R2, kappa-coefficient, …

December, 2008 © 2008, Jaime G. Carbonell 42

Sample Confusion Matrix

ShortedPower Sup

LooseConnect’s

BurntResistor

Not plugged in

ShortedPower Sup

50 0 10 0

LooseConnect’s

1 120 0 12

BurntResistor

12 0 60 0

Not plugged in

0 8 5 110

True Diagnoses

Pre

dic

ted

Dia

gn

oses

December, 2008 © 2008, Jaime G. Carbonell 43

Measuring Accuracy Accuracy = correct/total Error = incorrect/total Hence: accuracy = 1 – error

For the diagnosis example: A = 340/386 = 0.88, E = 1 – A = 0.12

ni njji

niii

c

c

CFull

CTraceA

,...1 ,...,1,

,...,1,

)(

)(

December, 2008 © 2008, Jaime G. Carbonell 44

What About Rare Events?

ShortedPower Sup

LooseConnect’s

BurntResistor

Not plugged in

ShortedPower Sup

0 0 10 0

LooseConnect’s

1 120 0 12

BurntResistor

12 0 60 0

Not plugged in

0 8 5 160

True Diagnoses

Pre

dic

ted

Dia

gn

oses

December, 2008 © 2008, Jaime G. Carbonell 45

Rare Event Evaluation Accuracy for example = 0.88

…but NO correct predictions for “shorted power supply”, 1 of 4 diagnoses

Alternative: Per-diagnosis (per-class) accuracy:

A(“shorted PS”) = 0/22 = 0 A(“not plugged in”) = 160/184 = 0.87

njij

jinj

ij

iii

cc

cclassA

,...,1,,

,...,1,

,)(

December, 2008 © 2008, Jaime G. Carbonell 46

ROC Curves (ROC=Receiver Operating Characteristic)

December, 2008 © 2008, Jaime G. Carbonell 47

ROC Curves (ROC=Receiver Operating Characteristic)

Sensitivity = TP/(TP+FN)

Specificity = TN/(TN+FP)

December, 2008 © 2008, Jaime G. Carbonell 48

If Plenty of data, evaluate with Holdout Set

Data

evaluate

measure error

train

Often also used for parameter optimization

December, 2008 © 2008, Jaime G. Carbonell 49

Finite Cross-Validation Set True error:

Test error:

D

D ydxyxpxfye ,),(),(

Syx

S xfym

e,

),(1

ˆ

D = all data

m = #test samples S = test data

(true risk)

(empirical risk)

December, 2008 © 2008, Jaime G. Carbonell 50

Confidence Intervals

If S contains m examples, drawn independently m 30

Then With approximately 95% probability, the true

error eD lies in the interval

m

eee SS

S

)ˆ1(ˆ96.1ˆ

December, 2008 © 2008, Jaime G. Carbonell 51

Example: Hypothesis misclassifies 12 out of 40 examples in

cross validation set S. Q: What will the “true” error on future examples? A: With 95% confidence, the true error will be in

the interval:

m

eee SS

S

)ˆ1(ˆ96.1ˆ]44.0;16.0[

3.040

12ˆ Se40m 14.0

)ˆ1(ˆ96.1

m

ee SS

Confidence Intervals

If S contains n examples, drawn independently n 30

Then With approximately N% probability, the true

error eD lies in the interval

m

eeze SS

NS

)ˆ1(ˆˆ

N% 50% 68% 80% 90% 95% 98% 99%

zN 0.67 1.0 1.28 1.64 1.96 2.33 2.58

December, 2008 © 2008, Jaime G. Carbonell 53

Finite Cross-Validation Set True error:

Test error:

Number of test errors: Is Binomially distributed:

D

D ydxyxpxfye ,),(),(

Syx

S xfym

e,

),(1

ˆ

knD

kD

Syx

eekmk

mkxfyp

)1()(

)!(!

!),(

,

k-fold Cross ValidationData

Train on yellow, evaluate on pink error5

Train on yellow, evaluate on pink error6

Train on yellow, evaluate on pink error7

Train on yellow, evaluate on pink error1

Train on yellow, evaluate on pink error3

Train on yellow, evaluate on pink error4

Train on yellow, evaluate on pink error8

Train on yellow, evaluate on pink error2

error = errori / k

k-way split

December, 2008 © 2008, Jaime G. Carbonell 55

Cross Validation Procedure Purpose: Evaluate DM accuracy on training data Experiment: Try different similarity functions, etc. Process:

Divide the training data into k equal pieces (each piece is called a “fold”)

Train the classifier using all but kth fold Test for accuracy on kth fold Repeat with kth-1 fold held out for testing, then

with kth-2 fold for testing, till tested on all folds Report the average accuracy across folds

The JackknifeData

December, 2008 © 2008, Jaime G. Carbonell 57

Comparing Different Hypotheses: Paired t test True difference:

For each partition k:

Average:

N% Confidence interval:

k

iid

kd

1

ˆ1ˆ

)()( 21 DD eed

k

iikN kk

td1

21, )ˆˆ(

)1(

test error for partition k

)(ˆ)(ˆˆ2,1, kSkSk eed

k-1 is degrees of freedom N is confidence level

December, 2008 © 2008, Jaime G. Carbonell 58

Version Spaces (Mitchell, 1980)

G boundary

S boundary

“Target” concept N

Specific Instances

Anything

b

December, 2008 © 2008, Jaime G. Carbonell 59

Original & Seeded Version Spaces Version-spaces (Mitchell, 1980)

Symbolic multivariate learning S & G sets define lattice boundaries Exponential worst-case: O(bN)

Seeded Version Spaces (Carbonell, 2002)

Generality level hypothesis seed S & G subsets effective lattice Polynomial worst case: O(bk/2), k=3,4

December, 2008 © 2008, Jaime G. Carbonell 60

Seeded Version Spaces (Carbonell, 2002)

G boundary

S boundary

“Target” concept N

Xn Ym

“The big book” “ el libro grande”

Det Adj N Det N Adj(Y2 num) = (Y3 num)(Y2 gen) = (Y3 gen)(X3 num) = (Y2 num)

December, 2008 © 2008, Jaime G. Carbonell 61

Seeded Version Spaces

S boundary

“Target” concept Seed(best guess)

kN

G boundary

Xn Ym

“The big book” “ el libro grande”

December, 2008 © 2008, Jaime G. Carbonell 62

Naïve Bayes Classification

Some Notation:

Training instance index i = 1, 2, …, I Term index j = 1, 2, …, J Category index k = 1, 2, …, K

Training data D (k) = ((xi, yi (k) ))

Instance feature vector xi = (1, ni1, ni2, …, niJ),

Output labels yi = (yi (1) , yi (2) , …, yi(K) ) , yi

(k) = 1 or 0

December, 2008 © 2008, Jaime G. Carbonell 63

Bayes Classifier Assigning the most probable category to x

?)|,,(ˆ)|(ˆ1 kiJiki cnnPcxP

)|(log)(logmaxarg

)|()(maxarg

)(

)|()(maxarg

)(maxargˆ

kkk

kkk

kkk

kk

cxPcP

cxPcP

xP

cxPcP

|xcPc

in instances trainingof #

)(ˆI

ccP k

k

Bayes Rule

(MLE)

(Multinomial Distribution)

December, 2008 © 2008, Jaime G. Carbonell 64

Maximum Likelihood Estimate (MLE)

n: # of objects in a random sample from an populationm: # of instances of a category among the n-object samplep: true probability of any object belonging to the category

Likelihood of observing the data given model p is defined as:

Setting the derivative of f(p) to zero yields:

)()1log()(log)1(log

i.i.d. assuming,)1()|(

)(~},1,0{,)|,,()|()|(

1

1

pfpmnpmpp

pppYP

pBerYYpYYPpDPpDL

mnm

mnmn

i i

iinnn

n

mppmnmp

p

mn

p

mpf

dp

d

,)()1(,1

)(0

December, 2008 © 2008, Jaime G. Carbonell 65

Binomial Distribution

Consider coin toss as a Bernoulli process, X ~ Ber(p)

qpTailPpHeadP 1)(,)(

3232

!3!2

!5

2

5)5|2 is heads of (# qpqpnP

What is the probability of seeing 2 heads out of 5 tosses?

knkn

i i ppk

nkYPpnBinYXY

)1()(,),(~,

1

Observing k heads in n tosses follows a binomial distribution:

December, 2008 © 2008, Jaime G. Carbonell 66

6

16216611 ......),,(

k

nk

kpnnn

nnXnXP

Multinomial Distribution

Consider tossing a 6-faced dice n times with probabilities p1, p2, …, p6 where the probabilities sum up to 1.

Let the count of observing each face as a random variable, we have a multinomial process defined as

.,0

)...,,(~),,,(6

1

621621

nXnX

pppnMulXXX

j jj

December, 2008 © 2008, Jaime G. Carbonell 67

Multinomial NB

The conditional probability is

We can remove the first term from the objective function

term)a is ()|(!!...!

!)|()|(

121

,

j

J

j

nj

xJxx

xx tctP

nnn

ncnPcxP jx

J

j jxjj

nj ctPnctPcxP jx

1)|(log)|()|(

December, 2008 © 2008, Jaime G. Carbonell 68

Smoothing Methods

Laplace Smoothing (common)

Two-state Hidden Markov Model (BBN, or Jelinek-Mercer Interpolation)

Hierarchical Smoothing (McCallum, ICML’98)

Lambda’s (summing to 1) are the mixture weights, obtained by running an EM algorithm on a validation set.

Vtct

ct

nV

nctP

|

|

||

1)|(~

)()1()|()|(~ tPctPctP

)|(...)|()|()|(~ 2

21h

h ctPctPctPctP

December, 2008 © 2008, Jaime G. Carbonell 69

Basic Assumptions

Term independence:

Expecting one objective attribute y per instance:

Continuity of instances in the same class (one-mode per class)

1)( k

kcP

||

||,2,1,

)|(!!...!

!)|(maxarg)|(maxarg

V

Vt

nk

Vddd

dkdkkk

dtctPnnn

ncnPcdP

...)|()|()|( 2121 ii n

kn

kki ctPctPcxP

December, 2008 © 2008, Jaime G. Carbonell 70

NB and Cross Entropy

Entropy Measuring the uncertainty –

lower entropy means easier predictions

Minimum coding length if distribution p is known

Cross Entropy Measuring the coding length

(in # of bits) based on distribution q when the true distribution is p q)D(ppH

p

qppp

p

qpp

qpqH(p

k

k

kkk

kk

k

kk

kk

kk

k

||)(

loglog

log

log)||

1),,,(

log)(

1

kkK

kk

k

pppp

pppH

December, 2008 © 2008, Jaime G. Carbonell 71

Kullback Liebler (KL) Divergence

Also called “Relative Entropy” Measuring the difference between two

distributions Zero valued if p = q Not inter-changeable

k

k

kk p

qpq)D(p log||

NB and Cross Entropy (cont’d)

December, 2008 © 2008, Jaime G. Carbonell 72

NB & Cross Entropy (cont’d)

)||()(logminarg

)||()(logmaxarg

)|(log)|(ˆ)(logmaxarg

)|(log)(log

maxarg

)|(log)(logmaxarg*

ki

ki

ij

ij

cxkk

cxkk

tkjij

i

k

k

kxt i

ij

i

k

k

kjxt

ijkk

qpHcP

qpHcP

ctPxtPn

cP

ctPn

n

n

cP

ctPncPk

Minimum Description Length (MDL) Classifier

December, 2008 © 2008, Jaime G. Carbonell 73

Concluding Remarks on NB

Pros Explicit probabilistic reasoning Relatively effective, fast online response (as an eager learning)

Cons Scoring function (logarithm of term probabilities) would be too

sensitive to measurement errors on rare features One-class-per-instance assumption imposes both theoretical

and practical limitations Empirically weak when dealing with rare categories and large

feature sets

December, 2008 © 2008, Jaime G. Carbonell 74

Statistical Decision Theory

Random input X in RJ

Random output Y in {1,2, …, K} Prediction f(X) in {1,2, …, K} Loss function (0-1 loss for classification)

L(y(x), f(x)) = 0 iff f(x) = y(x) L(y(x), f(x)) = 1 otherwise

Expected Prediction Error (EPE)

}{maxarg}1{minarg

))(,(minarg)(ˆ

)|())(,(

)(},,1{

)(},,1{

1

)(},,1{)(

1

kxKk

kxKk

K

k

kxKxf

K

kX

xfkLxf

XYPXfYLEPE

Minimizing EPE pointwise

December, 2008 © 2008, Jaime G. Carbonell 75

Selection of ML Algorithm (I)

Method Training Data Requirements

Random Noise Tolerance

Scalability (atts + data)

Rule Induction Sparse None Good

Decision Trees Sparse-Dense Some Excellent

Naïve Bayes Medium-Dense Some-Good Medium

Regression Medium-Dense Some-Good Good

kNN Sparse-Dense Some-Good Good-Excellent

SVM Medium-Dense Some-Good Good-Excellent

Neural Nets Dense Good Poor-Medium

December, 2008 © 2008, Jaime G. Carbonell 76

Selection of ML Algorithm (II)

Method Quality of Prediction

Explanatory Power

Popularity of Usage

Rule Induction Good, brittle Very clear Med, declining

Decision Trees Good/category Very clear High, stable

Naïve Bayes Medium/cat Partial Med, declining

Regression Good/both Partial-Poor High, stable

kNN Good/both Partial-Good Med, increasing

SVM Very good/cat Poor Med, increasing

Neural Nets Good/cat Poor High, declining