December, 2008© 2008, Jaime G Carbonell Jaime Carbonell (with contributions from Tom Mitchell, Sebastian Thrun and Yiming Yang) Carnegie Mellon University

December, 2008 © 2008, Jaime G Carbonell

Jaime Carbonell (with contributions from Tom Mitchell, Sebastian Thrun and Yiming Yang)

Carnegie Mellon [email protected]

Machine Learning & Data MiningPart 1: The Basics

December, 2008 © 2008, Jaime G. Carbonell 2

Some Definitions (KBS vs ML) Knowledge-Based Systems

Rules, procedures, semantic nets, Horn clauses Inference: matching, inheritance, resolution Acquisition: manually from human experts

Machine Learning Data: tables, relations, attribute lists, … Inference: rules, trees, decision functions, … Acquisition: automated from data

Data Mining Machine learning applied to large real problems May be augmented with KBS


Ingredients for Machine Learning

“Historical” data (e.g. DB tables) E.g. products (features, marketing, support, …) E.g. competition (products, pricing, customers) E.g. customers (demographics, purchases, …)

Objective function (to be predicted or optimized) E.g. maximize revenue per customer E.g. minimize manufacturing defects

Scalable machine learning method(s) E.g. decision-tree induction, logistic regression E.g. “active” learning, clustering


Sample ML/DM Applications I

Credit Scoring Training: past applicant profiles, how much

credit given, payback or default Input: applicant profile (income, debts, …) Objective: credit-score + max amount

Fraud Detection (e.g. credit-card transactions) Training: past known legitimate & fraudulent

transactions Input: proposed transaction (loc, cust, $$, …) Objective: approve/block decision


Sample ML/DM Applications II

Demographic Segmentation Training: past customer profiles (age, gender,

education, income,…) + product preferences Input: new product description (features) Objective: predict market segment affinity

Marketing/Advertisement Effectiveness Training: past advertisement campaigns,

demographic targets, product categories Input: proposed advertisement campaign Objective: project effectiveness (sales

increase modulated by marketing cost)


Sample ML/DM Applications III

Product (or Part) Reliability Training: past products/parts + specs at

manufacturing + customer usage + maint rec Input: new part + expected usage Objective: mean-time-to-failure (replacement)

Manufacturing Tolerances Training: past product/part manufacturing

process, tolerances, inspections, … Input: new part + expected usage Objective: optimal manufacturing precision

(minimize costs of failure + manufacture)


Sample ML/DM Applications IV

Mechanical Diagnosis Training: past observed symptoms at (or prior

to) breakdown + underlying cause Input: current symptoms Objective: predict cause of failure

Mechanical Repair Training: cause of failure + product usage +

repair (or PM) effectiveness Input: new failure cause + product usage Objective: recommended repair (or preventive

maintenance operation)


Sample ML/DM Applications V

Billeting (job assignments) Training: employee profiles, position profiles,

employee performance in assigned position Input: new employee or new position profile Objective: predict performance in position

Text Mining & Routing (e.g. customer centers) Training: electronic problem reports, customer

requests + who should handle them Input: new incoming texts Objective: Assign category + route or reply


Preparing Historical Data Extract a DB table with all the needed information

Select, join, project, aggregate, … Filter out rows with significant missing data

Determine predictor attributes (columns) Ask domain expert for relevant attributes, or Start with all attributes and automatically sub-

select most predictive ones (feature selection) Determine to-be-predicted attribute (column)

Objective of the DM (number, decision, …)


Sample DB Table [predictor attributes] [objective]

Tot Num Max NumAcct. Income Job Delinq Delinq Owns Credit Goodnumb. in K/yr Now? accts cycles home? years cust.?---------------------------------------------------------------------------1001 85 Y 1 1 N 2 Y1002 60 Y 3 2 Y 5 N1003 ? N 0 0 N 2 N1004 95 Y 1 2 N 9 Y1005 110 Y 1 6 Y 3 Y1006 29 Y 2 1 Y 1 N1007 88 Y 6 4 Y 8 N1008 80 Y 0 0 Y 0 Y1009 31 Y 1 1 N 1 Y1011 ? Y ? 0 ? 7 Y1012 75 ? 2 4 N 2 N1013 20 N 1 1 N 3 N1014 65 Y 1 3 Y 1 Y1015 65 N 1 2 N 8 Y1016 20 N 0 0 N 0 N1017 75 Y 1 3 N 2 N1018 40 N 0 0 Y 1 Y


Supervised Learning on DB Table Given: DB table

With identified predictor attributes x1, x2,…

And objective attribute y Find: Prediction Function

Subject to: Error Minimization on data table M

Least-squares error, or L1-norm, or L-norm, …

yxxF nk ,...,: 1 },...,{ 21 mk FFFF

]))(([min 2

)(),...{ 1

MRowsi

ikifff

best xfyArgfmk


Popular Predictor Functions Linear Discriminators (next slides) k-Nearest-Neighbors (lecture #2) Decision Trees (lecture #5) Linear & Logistic Regression (lecture #4) Probabilistic Methods (Lecture #3) Neural Networks

2-layer Logistic Regression Multi-layer Difficult to scale up

Classification Rule Induction (in a few slides)


Linear Discriminator Functions

Two class problem:

y={ , }

x1

x2



Two class problem:

y={ , }

x1

x2



Two class problem:

y={ , }

x1

x2

i

n

ii xay

0



Two class problem:

y={ , }

x1

x2

new

i

n

ii xay

0


Issues with Linear Discriminators What is the “best” placement of the discriminator?

Maximize the margin In general Support Vector Machines

What if there are k classes (K > 2)? Must learn k different discriminators Each discriminates ki vs kji (all other classes)

What if it classes are not linearly separable? Minimal error (L1 or L2) placement (regression) Give up on linear discriminators ( other fk’s)


Maximizing the Margin

Two class problem:

y={ , }

x1

x2

margin


Nearly-Separable Classes

Two class problem:

y={ , }

x1

x2


Nearly-Separable Classes

Two class problem:

y={ , }

x1

x2


Minimizing Training Error Optimal placing of maximum-margin separator

Quadratic programming (Support Vector Machines) Slack variables to accommodate training errors

Minimizing error metrics Number of errors

Magnitude of error

Squared error

Chevycheff norm

)),((1

),,(..1

0 ini

i yxfIn

yXfL

)),(())((1

),,( 2

..12 iii

nii yxfIyxf

nyXfL

),(()(),,(..1

1 iini

ii yxfIyxfyXfL

)),(())((max),,(..1

iiiini

yxfIyxfyXfL


Symbolic Rule Induction

General idea Labeled instances are DB tuples Rules are generalized tuples Generalization occurs at terms in tuples Generalize on new E+ not correctly predicted Specialize on new E- not correctly predicted Ignore predicted E+ or E- (error-driven learning)


Symbolic Rule Induction (2)

Example term generalizations Constant => disjunction

e.g. if small portion value set seen Constant => least-common-generalizer class

e.g. if large portion of value set seen Number (or ordinal) => range

e.g. if dense sequential sampling

Symbolic Rule Induction Example (1)

Age Gender Temp b-cult c-cult loc Skin disease65 M 101 + .23 USA normal strep25 M 102 + .00 CAN normal strep65 M 102 - .78 BRA rash dengue36 F 99 - .19 USA normal *none*11 F 103 + .23 USA flush strep88 F 98 + .21 CAN normal *none*39 F 100 + .10 BRA normal strep12 M 101 + .00 BRA normal strep15 F 101 + .66 BRA flush dengue20 F 98 + .00 USA rash *none*81 M 98 - .99 BRA rash ec-1287 F 100 - .89 USA rash ec-1212 F 102 + ?? CAN normal strep

14 F 101 + .33 USA normal67 M 102 + .77 BRA rash

Symbolic Rule Induction Example (2)

Candidate Rules:

IF age = [12,65]gender = *any*temp = [100,103]b-cult = +c-cult = [.00,.23]loc = *any*skin = (normal,flush)

THEN: strep

IF age = (15,65)gender = *any*temp = [101,102]b-cult = *any*c-cult = [.66,.78]loc = BRAskin = rash

THEN: dengue

Disclaimer: These are not real medical records or rules


Types of Data Mining “Supervised” Methods (this DM course)

Training data has both predictor attributes & objective (to be predicted) attributes

Predict discrete classes classification Predict continuous values regression Duality: classification regression

“Unsupervised” Methods Training data without objective attributes Goal: find novel & interesting patterns Cutting-edge research, fewer success stories Semi-supervised methods: market-basket, …


Machine Learning Application Process in a Nutshell Choose problem where

Prediction is valuable and non-trivial Sufficient historical data is available The objective is measurable (incl in past data)

Prepare the data Tabular form, clean, divide training & test sets

Select a Machine Learning algorithm Human readable decision fn rules, trees, … Robust with noisy data kNN, logistic reg, …


Machine Learning Application Process in a Nutshell (2) Train ML Algorithm on Training Data Set

Each ML method has different training process Training uses both predictor & objective att’s

Run Training ML Algorithm on Test Data Set Test uses only predictor att’s & outputs

predictions on objective attributes Compare predictions vs actual objective att’s

(see lecture 2 for evaluation metrics) If Accuracy threshold, done.

Else, try different ML algorithm, different parameter settings, get more training data, …


Sample DB Table (same) [predictor attributes] [objective]

Tot Num Max NumAcct. Income Job Delinq Delinq Owns Credit Goodnumb. in K/yr Now? accts cycles home? years cust.?---------------------------------------------------------------------------1001 85 Y 1 1 N 2 Y1002 60 Y 3 2 Y 5 N1003 ? N 0 0 N 2 N1004 95 Y 1 2 N 9 Y1005 100 Y 1 6 Y 3 Y1006 29 Y 2 1 Y 1 N1007 88 Y 6 4 Y 8 N1008 80 Y 0 0 Y 0 Y1009 31 Y 1 1 N 1 Y1011 ? Y ? 0 ? 7 Y1012 75 ? 2 4 N 2 N1013 20 N 1 1 N 3 N1014 65 Y 1 3 Y 1 Y1015 65 N 1 2 N 8 Y1016 20 N 0 0 N 0 N1017 75 Y 1 3 N 2 N1018 40 N 0 0 Y 10 Y


Feature Vector Representation Predictor-attribute rows in DB tables can be

represented as vectors. For instance, the 2nd & 4th rows of predictor attributes in our DB table are:

R2 = [60 Y 3 2 Y 5]

R4 = [95 Y 1 2 N 9]

Converting to numbers (Y = 1, N = 0), we get:

R2 = [60 1 3 2 1 5]

R4 = [95 1 1 2 0 9]


Vector Similarity Suppose we have a new credit applicant

R-new = [65 1 1 2 0 10]

To which of R2 or R4 is she closer?

R2 = [60 1 3 2 1 5]

R4 = [95 1 1 2 0 9]

What should we use as a SIMILARITY METRIC? Should we first NORMALIZE the vectors?

If not, the largest component will dominate


Normalizing Vector Attributes Linear Normalization (often sufficient)

Find max & min values for each attribute Normalize each attribute by:

Apply to all vectors (historical + new) …by normalizing each attribute, e.g.:

)(

)(

minmax

min

AA

AAA actual

norm

5.0)20100()2060(1,2 RA


Normalizing Full Vectors Normalizing the new applicant vector

R-new = [65 1 1 2 0 10] [.56 1 .17 .33 0 1] And normalizing the two past customer vectors

R2 = [60 1 3 2 1 5] [.50 1 .50 .33 1 .50] R4 = [95 1 1 2 0 9] [.94 1 .17 .33 0 .90]

How about if some attributes are known to be more important, say salary (A1) & delinquencies (A3)? Weight accordingly, e.g. x2 for each E.g., R-new-weighted: [1.12 1 .34 .33 0 1]


Similarity Functions (inverse dist) Now that we have weighted normalized vectors,

how do we tell exactly their degree of similarity? Inverse sum of differences (L1)

Inverse Euclidean distance (L2)

||

1),(

,...1i

nii

diffinv babasim

2

,...1

)(

1),(

ini

i

Euclidba

basim


Similarity Functions (direct) Dot-Product Similarity

Cosine Similarity (dot product of unit vectors)

ini

idot bababasim

,...,1

),(

nii

nii

ini

i

ba

ba

ba

babasim

,...,1

2

,...,1

2

,...,1cos ),(


Alternative: Similarity Matrix for Non-Numeric Attributes

tiny little small medium large hugetiny 1.0 0.8 0.7 0.5 0.2 0.0little 1.0 0.9 0.7 0.3 0.1small 1.0 0.7 0.3 0.2medium 1.0 0.5 0.3large 1.0 0.8huge 1.0

Diagonal must be 1.0 Monotonicity property must hold Triangle inequality must hold Transitive property must hold Additivity/Compostionality need not hold


k-Nearest Neighbors Method No explicit “training” phase When new case arrives (vector of predictor att’s)

Find nearest k neighbors (max similarity) among previous cases (row vectors in DB table)

k-neighbors vote for objective attribute Unweighted majority vote, or Similarity-weighted vote

Works for both discrete or continuous objective attributes


Similarity-Weighted Voting in kNN If the Objective Attribute is Discrete:

If the Objective Attribute is Continuous:

])([&)]([)(),(maxarg)(

ijobjjobji CxvalueykNNxj

ValueRangeCobj yxsimyValue

)(

)(

),(

),()(

)(

ykNNxj

ykNNxjjobj

obj

j

j

yxsim

yxsimxvalue

yValue


Applying kNN to Real Problems 1 How does one choose the vector representation?

Easy: Vector = predictor attributes What if attributes are not numerical?

Convert: (e.g. High=2, Med=1, Low=0), Or, use similarity function over nominal values

E.g. equality or edit-distance on strings How does one choose a distance function?

Hard: No magic recipe; try simpler ones first This implies a need for systematic testing

(discussed in coming slides)


Applying kNN to Real Problems 2

How does one determine whether data should be normalized? Normalization is usually a good idea One can try kNN both ways to make sure

How does one determine “k” in kNN? k is often determined empirically Good start is:

))((log2 DBsizek


Evaluating Machine Learning Accuracy = Correct-Predictions/Total-Predictions

Simplest & most popular metric But misleading on very-rare event prediction

Precision, recall & F1 Borrowed from Information Retrieval Applicable to very-rare event prediction

Correlation (between predicted & actual values) for continuous objective attributes R2, kappa-coefficient, …


Sample Confusion Matrix

ShortedPower Sup

LooseConnect’s

BurntResistor

Not plugged in

ShortedPower Sup

50 0 10 0

LooseConnect’s

1 120 0 12

BurntResistor

12 0 60 0

Not plugged in

0 8 5 110

True Diagnoses

Pre

dic

ted

Dia

gn

oses


Measuring Accuracy Accuracy = correct/total Error = incorrect/total Hence: accuracy = 1 – error

For the diagnosis example: A = 340/386 = 0.88, E = 1 – A = 0.12

ni njji

niii

c

c

CFull

CTraceA

,...1 ,...,1,

,...,1,

)(

)(


What About Rare Events?

ShortedPower Sup

LooseConnect’s

BurntResistor

Not plugged in

ShortedPower Sup

0 0 10 0

LooseConnect’s

1 120 0 12

BurntResistor

12 0 60 0

Not plugged in

0 8 5 160

True Diagnoses

Pre

dic

ted

Dia

gn

oses


Rare Event Evaluation Accuracy for example = 0.88

…but NO correct predictions for “shorted power supply”, 1 of 4 diagnoses

Alternative: Per-diagnosis (per-class) accuracy:

A(“shorted PS”) = 0/22 = 0 A(“not plugged in”) = 160/184 = 0.87

njij

jinj

ij

iii

cc

cclassA

,...,1,,

,...,1,

,)(


ROC Curves (ROC=Receiver Operating Characteristic)


ROC Curves (ROC=Receiver Operating Characteristic)

Sensitivity = TP/(TP+FN)

Specificity = TN/(TN+FP)


If Plenty of data, evaluate with Holdout Set

Data

evaluate

measure error

train

Often also used for parameter optimization


Finite Cross-Validation Set True error:

Test error:

D

D ydxyxpxfye ,),(),(

Syx

S xfym

e,

),(1

ˆ

D = all data

m = #test samples S = test data

(true risk)

(empirical risk)


Confidence Intervals

If S contains m examples, drawn independently m 30

Then With approximately 95% probability, the true

error eD lies in the interval

m

eee SS

S

)ˆ1(ˆ96.1ˆ


Example: Hypothesis misclassifies 12 out of 40 examples in

cross validation set S. Q: What will the “true” error on future examples? A: With 95% confidence, the true error will be in

the interval:

m

eee SS

S

)ˆ1(ˆ96.1ˆ]44.0;16.0[

3.040

12ˆ Se40m 14.0

)ˆ1(ˆ96.1

m

ee SS

Confidence Intervals

If S contains n examples, drawn independently n 30

Then With approximately N% probability, the true

error eD lies in the interval

m

eeze SS

NS

)ˆ1(ˆˆ

N% 50% 68% 80% 90% 95% 98% 99%

zN 0.67 1.0 1.28 1.64 1.96 2.33 2.58


Finite Cross-Validation Set True error:

Test error:

Number of test errors: Is Binomially distributed:

D

D ydxyxpxfye ,),(),(

Syx

S xfym

e,

),(1

ˆ

knD

kD

Syx

eekmk

mkxfyp

)1()(

)!(!

!),(

,

k-fold Cross ValidationData

Train on yellow, evaluate on pink error5








error = errori / k

k-way split


Cross Validation Procedure Purpose: Evaluate DM accuracy on training data Experiment: Try different similarity functions, etc. Process:

Divide the training data into k equal pieces (each piece is called a “fold”)

Train the classifier using all but kth fold Test for accuracy on kth fold Repeat with kth-1 fold held out for testing, then

with kth-2 fold for testing, till tested on all folds Report the average accuracy across folds

The JackknifeData


Comparing Different Hypotheses: Paired t test True difference:

For each partition k:

Average:

N% Confidence interval:

k

iid

kd

1

ˆ1ˆ

)()( 21 DD eed

k

iikN kk

td1

21, )ˆˆ(

)1(

1ˆ

test error for partition k

)(ˆ)(ˆˆ2,1, kSkSk eed

k-1 is degrees of freedom N is confidence level


Version Spaces (Mitchell, 1980)

G boundary

S boundary

“Target” concept N

Specific Instances

Anything

b


Original & Seeded Version Spaces Version-spaces (Mitchell, 1980)

Symbolic multivariate learning S & G sets define lattice boundaries Exponential worst-case: O(bN)

Seeded Version Spaces (Carbonell, 2002)

Generality level hypothesis seed S & G subsets effective lattice Polynomial worst case: O(bk/2), k=3,4


Seeded Version Spaces (Carbonell, 2002)

G boundary

S boundary

“Target” concept N

Xn Ym

“The big book” “ el libro grande”

Det Adj N Det N Adj(Y2 num) = (Y3 num)(Y2 gen) = (Y3 gen)(X3 num) = (Y2 num)


Seeded Version Spaces

S boundary

“Target” concept Seed(best guess)

kN

G boundary

Xn Ym

“The big book” “ el libro grande”


Naïve Bayes Classification

Some Notation:

Training instance index i = 1, 2, …, I Term index j = 1, 2, …, J Category index k = 1, 2, …, K

Training data D (k) = ((xi, yi (k) ))

Instance feature vector xi = (1, ni1, ni2, …, niJ),

Output labels yi = (yi (1) , yi (2) , …, yi(K) ) , yi

(k) = 1 or 0


Bayes Classifier Assigning the most probable category to x

?)|,,(ˆ)|(ˆ1 kiJiki cnnPcxP

)|(log)(logmaxarg

)|()(maxarg

)(

)|()(maxarg

)(maxargˆ

kkk

kkk

kkk

kk

cxPcP

cxPcP

xP

cxPcP

|xcPc

in instances trainingof #

)(ˆI

ccP k

k

Bayes Rule

(MLE)

(Multinomial Distribution)


Maximum Likelihood Estimate (MLE)

n: # of objects in a random sample from an populationm: # of instances of a category among the n-object samplep: true probability of any object belonging to the category

Likelihood of observing the data given model p is defined as:

Setting the derivative of f(p) to zero yields:

)()1log()(log)1(log

i.i.d. assuming,)1()|(

)(~},1,0{,)|,,()|()|(

1

1

pfpmnpmpp

pppYP

pBerYYpYYPpDPpDL

mnm

mnmn

i i

iinnn

n

mppmnmp

p

mn

p

mpf

dp

d

,)()1(,1

)(0


Binomial Distribution

Consider coin toss as a Bernoulli process, X ~ Ber(p)

qpTailPpHeadP 1)(,)(

3232

!3!2

!5

2

5)5|2 is heads of (# qpqpnP

What is the probability of seeing 2 heads out of 5 tosses?

knkn

i i ppk

nkYPpnBinYXY

)1()(,),(~,

1

Observing k heads in n tosses follows a binomial distribution:


6

16216611 ......),,(

k

nk

kpnnn

nnXnXP

Multinomial Distribution

Consider tossing a 6-faced dice n times with probabilities p1, p2, …, p6 where the probabilities sum up to 1.

Let the count of observing each face as a random variable, we have a multinomial process defined as

.,0

)...,,(~),,,(6

1

621621

nXnX

pppnMulXXX

j jj


Multinomial NB

The conditional probability is

We can remove the first term from the objective function

term)a is ()|(!!...!

!)|()|(

121

,

j

J

j

nj

xJxx

xx tctP

nnn

ncnPcxP jx

J

j jxjj

nj ctPnctPcxP jx

1)|(log)|()|(


Smoothing Methods

Laplace Smoothing (common)

Two-state Hidden Markov Model (BBN, or Jelinek-Mercer Interpolation)

Hierarchical Smoothing (McCallum, ICML’98)

Lambda’s (summing to 1) are the mixture weights, obtained by running an EM algorithm on a validation set.

Vtct

ct

nV

nctP

|

|

||

1)|(~

)()1()|()|(~ tPctPctP

)|(...)|()|()|(~ 2

21h

h ctPctPctPctP


Basic Assumptions

Term independence:

Expecting one objective attribute y per instance:

Continuity of instances in the same class (one-mode per class)

1)( k

kcP

||

||,2,1,

)|(!!...!

!)|(maxarg)|(maxarg

V

Vt

nk

Vddd

dkdkkk

dtctPnnn

ncnPcdP

...)|()|()|( 2121 ii n

kn

kki ctPctPcxP


NB and Cross Entropy

Entropy Measuring the uncertainty –

lower entropy means easier predictions

Minimum coding length if distribution p is known

Cross Entropy Measuring the coding length

(in # of bits) based on distribution q when the true distribution is p q)D(ppH

p

qppp

p

qpp

qpqH(p

k

k

kkk

kk

k

kk

kk

kk

k

||)(

loglog

log

log)||

1),,,(

log)(

1

kkK

kk

k

pppp

pppH


Kullback Liebler (KL) Divergence

Also called “Relative Entropy” Measuring the difference between two

distributions Zero valued if p = q Not inter-changeable

k

k

kk p

qpq)D(p log||

NB and Cross Entropy (cont’d)


NB & Cross Entropy (cont’d)

)||()(logminarg

)||()(logmaxarg

)|(log)|(ˆ)(logmaxarg

)|(log)(log

maxarg

)|(log)(logmaxarg*

ki

ki

ij

ij

cxkk

cxkk

tkjij

i

k

k

kxt i

ij

i

k

k

kjxt

ijkk

qpHcP

qpHcP

ctPxtPn

cP

ctPn

n

n

cP

ctPncPk

Minimum Description Length (MDL) Classifier


Concluding Remarks on NB

Pros Explicit probabilistic reasoning Relatively effective, fast online response (as an eager learning)

Cons Scoring function (logarithm of term probabilities) would be too

sensitive to measurement errors on rare features One-class-per-instance assumption imposes both theoretical

and practical limitations Empirically weak when dealing with rare categories and large

feature sets


Statistical Decision Theory

Random input X in RJ

Random output Y in {1,2, …, K} Prediction f(X) in {1,2, …, K} Loss function (0-1 loss for classification)

L(y(x), f(x)) = 0 iff f(x) = y(x) L(y(x), f(x)) = 1 otherwise

Expected Prediction Error (EPE)

}{maxarg}1{minarg

))(,(minarg)(ˆ

)|())(,(

)(},,1{

)(},,1{

1

)(},,1{)(

1

kxKk

kxKk

K

k

kxKxf

K

kX

xfkLxf

XYPXfYLEPE

Minimizing EPE pointwise


Selection of ML Algorithm (I)

Method Training Data Requirements

Random Noise Tolerance

Scalability (atts + data)

Rule Induction Sparse None Good

Decision Trees Sparse-Dense Some Excellent

Naïve Bayes Medium-Dense Some-Good Medium

Regression Medium-Dense Some-Good Good

kNN Sparse-Dense Some-Good Good-Excellent

SVM Medium-Dense Some-Good Good-Excellent

Neural Nets Dense Good Poor-Medium


Selection of ML Algorithm (II)

Method Quality of Prediction

Explanatory Power

Popularity of Usage

Rule Induction Good, brittle Very clear Med, declining

Decision Trees Good/category Very clear High, stable

Naïve Bayes Medium/cat Partial Med, declining

Regression Good/both Partial-Poor High, stable

kNN Good/both Partial-Good Med, increasing

SVM Very good/cat Poor Med, increasing

Neural Nets Good/cat Poor High, declining

Documents

December, 2008© 2008, Jaime G Carbonell Jaime Carbonell (with contributions from Tom Mitchell, Sebastian Thrun and Yiming Yang) Carnegie Mellon University