Upload
theresa-stevens
View
221
Download
3
Tags:
Embed Size (px)
Citation preview
December, 2008 © 2008, Jaime G Carbonell
Jaime Carbonell (with contributions from Tom Mitchell, Sebastian Thrun and Yiming Yang)
Carnegie Mellon [email protected]
Machine Learning & Data MiningPart 1: The Basics
December, 2008 © 2008, Jaime G. Carbonell 2
Some Definitions (KBS vs ML) Knowledge-Based Systems
Rules, procedures, semantic nets, Horn clauses Inference: matching, inheritance, resolution Acquisition: manually from human experts
Machine Learning Data: tables, relations, attribute lists, … Inference: rules, trees, decision functions, … Acquisition: automated from data
Data Mining Machine learning applied to large real problems May be augmented with KBS
December, 2008 © 2008, Jaime G. Carbonell 3
Ingredients for Machine Learning
“Historical” data (e.g. DB tables) E.g. products (features, marketing, support, …) E.g. competition (products, pricing, customers) E.g. customers (demographics, purchases, …)
Objective function (to be predicted or optimized) E.g. maximize revenue per customer E.g. minimize manufacturing defects
Scalable machine learning method(s) E.g. decision-tree induction, logistic regression E.g. “active” learning, clustering
December, 2008 © 2008, Jaime G. Carbonell 4
Sample ML/DM Applications I
Credit Scoring Training: past applicant profiles, how much
credit given, payback or default Input: applicant profile (income, debts, …) Objective: credit-score + max amount
Fraud Detection (e.g. credit-card transactions) Training: past known legitimate & fraudulent
transactions Input: proposed transaction (loc, cust, $$, …) Objective: approve/block decision
December, 2008 © 2008, Jaime G. Carbonell 5
Sample ML/DM Applications II
Demographic Segmentation Training: past customer profiles (age, gender,
education, income,…) + product preferences Input: new product description (features) Objective: predict market segment affinity
Marketing/Advertisement Effectiveness Training: past advertisement campaigns,
demographic targets, product categories Input: proposed advertisement campaign Objective: project effectiveness (sales
increase modulated by marketing cost)
December, 2008 © 2008, Jaime G. Carbonell 6
Sample ML/DM Applications III
Product (or Part) Reliability Training: past products/parts + specs at
manufacturing + customer usage + maint rec Input: new part + expected usage Objective: mean-time-to-failure (replacement)
Manufacturing Tolerances Training: past product/part manufacturing
process, tolerances, inspections, … Input: new part + expected usage Objective: optimal manufacturing precision
(minimize costs of failure + manufacture)
December, 2008 © 2008, Jaime G. Carbonell 7
Sample ML/DM Applications IV
Mechanical Diagnosis Training: past observed symptoms at (or prior
to) breakdown + underlying cause Input: current symptoms Objective: predict cause of failure
Mechanical Repair Training: cause of failure + product usage +
repair (or PM) effectiveness Input: new failure cause + product usage Objective: recommended repair (or preventive
maintenance operation)
December, 2008 © 2008, Jaime G. Carbonell 8
Sample ML/DM Applications V
Billeting (job assignments) Training: employee profiles, position profiles,
employee performance in assigned position Input: new employee or new position profile Objective: predict performance in position
Text Mining & Routing (e.g. customer centers) Training: electronic problem reports, customer
requests + who should handle them Input: new incoming texts Objective: Assign category + route or reply
December, 2008 © 2008, Jaime G. Carbonell 9
Preparing Historical Data Extract a DB table with all the needed information
Select, join, project, aggregate, … Filter out rows with significant missing data
Determine predictor attributes (columns) Ask domain expert for relevant attributes, or Start with all attributes and automatically sub-
select most predictive ones (feature selection) Determine to-be-predicted attribute (column)
Objective of the DM (number, decision, …)
December, 2008 © 2008, Jaime G. Carbonell 10
Sample DB Table [predictor attributes] [objective]
Tot Num Max NumAcct. Income Job Delinq Delinq Owns Credit Goodnumb. in K/yr Now? accts cycles home? years cust.?---------------------------------------------------------------------------1001 85 Y 1 1 N 2 Y1002 60 Y 3 2 Y 5 N1003 ? N 0 0 N 2 N1004 95 Y 1 2 N 9 Y1005 110 Y 1 6 Y 3 Y1006 29 Y 2 1 Y 1 N1007 88 Y 6 4 Y 8 N1008 80 Y 0 0 Y 0 Y1009 31 Y 1 1 N 1 Y1011 ? Y ? 0 ? 7 Y1012 75 ? 2 4 N 2 N1013 20 N 1 1 N 3 N1014 65 Y 1 3 Y 1 Y1015 65 N 1 2 N 8 Y1016 20 N 0 0 N 0 N1017 75 Y 1 3 N 2 N1018 40 N 0 0 Y 1 Y
December, 2008 © 2008, Jaime G. Carbonell 11
Supervised Learning on DB Table Given: DB table
With identified predictor attributes x1, x2,…
And objective attribute y Find: Prediction Function
Subject to: Error Minimization on data table M
Least-squares error, or L1-norm, or L-norm, …
yxxF nk ,...,: 1 },...,{ 21 mk FFFF
]))(([min 2
)(),...{ 1
MRowsi
ikifff
best xfyArgfmk
December, 2008 © 2008, Jaime G. Carbonell 12
Popular Predictor Functions Linear Discriminators (next slides) k-Nearest-Neighbors (lecture #2) Decision Trees (lecture #5) Linear & Logistic Regression (lecture #4) Probabilistic Methods (Lecture #3) Neural Networks
2-layer Logistic Regression Multi-layer Difficult to scale up
Classification Rule Induction (in a few slides)
December, 2008 © 2008, Jaime G. Carbonell 13
Linear Discriminator Functions
Two class problem:
y={ , }
x1
x2
December, 2008 © 2008, Jaime G. Carbonell 14
Linear Discriminator Functions
Two class problem:
y={ , }
x1
x2
December, 2008 © 2008, Jaime G. Carbonell 15
Linear Discriminator Functions
Two class problem:
y={ , }
x1
x2
i
n
ii xay
0
December, 2008 © 2008, Jaime G. Carbonell 16
Linear Discriminator Functions
Two class problem:
y={ , }
x1
x2
new
i
n
ii xay
0
December, 2008 © 2008, Jaime G. Carbonell 17
Issues with Linear Discriminators What is the “best” placement of the discriminator?
Maximize the margin In general Support Vector Machines
What if there are k classes (K > 2)? Must learn k different discriminators Each discriminates ki vs kji (all other classes)
What if it classes are not linearly separable? Minimal error (L1 or L2) placement (regression) Give up on linear discriminators ( other fk’s)
December, 2008 © 2008, Jaime G. Carbonell 18
Maximizing the Margin
Two class problem:
y={ , }
x1
x2
margin
December, 2008 © 2008, Jaime G. Carbonell 19
Nearly-Separable Classes
Two class problem:
y={ , }
x1
x2
December, 2008 © 2008, Jaime G. Carbonell 20
Nearly-Separable Classes
Two class problem:
y={ , }
x1
x2
December, 2008 © 2008, Jaime G. Carbonell 21
Minimizing Training Error Optimal placing of maximum-margin separator
Quadratic programming (Support Vector Machines) Slack variables to accommodate training errors
Minimizing error metrics Number of errors
Magnitude of error
Squared error
Chevycheff norm
)),((1
),,(..1
0 ini
i yxfIn
yXfL
)),(())((1
),,( 2
..12 iii
nii yxfIyxf
nyXfL
),(()(),,(..1
1 iini
ii yxfIyxfyXfL
)),(())((max),,(..1
iiiini
yxfIyxfyXfL
December, 2008 © 2008, Jaime G. Carbonell 22
Symbolic Rule Induction
General idea Labeled instances are DB tuples Rules are generalized tuples Generalization occurs at terms in tuples Generalize on new E+ not correctly predicted Specialize on new E- not correctly predicted Ignore predicted E+ or E- (error-driven learning)
December, 2008 © 2008, Jaime G. Carbonell 23
Symbolic Rule Induction (2)
Example term generalizations Constant => disjunction
e.g. if small portion value set seen Constant => least-common-generalizer class
e.g. if large portion of value set seen Number (or ordinal) => range
e.g. if dense sequential sampling
Symbolic Rule Induction Example (1)
Age Gender Temp b-cult c-cult loc Skin disease65 M 101 + .23 USA normal strep25 M 102 + .00 CAN normal strep65 M 102 - .78 BRA rash dengue36 F 99 - .19 USA normal *none*11 F 103 + .23 USA flush strep88 F 98 + .21 CAN normal *none*39 F 100 + .10 BRA normal strep12 M 101 + .00 BRA normal strep15 F 101 + .66 BRA flush dengue20 F 98 + .00 USA rash *none*81 M 98 - .99 BRA rash ec-1287 F 100 - .89 USA rash ec-1212 F 102 + ?? CAN normal strep
14 F 101 + .33 USA normal67 M 102 + .77 BRA rash
Symbolic Rule Induction Example (2)
Candidate Rules:
IF age = [12,65]gender = *any*temp = [100,103]b-cult = +c-cult = [.00,.23]loc = *any*skin = (normal,flush)
THEN: strep
IF age = (15,65)gender = *any*temp = [101,102]b-cult = *any*c-cult = [.66,.78]loc = BRAskin = rash
THEN: dengue
Disclaimer: These are not real medical records or rules
December, 2008 © 2008, Jaime G. Carbonell 26
Types of Data Mining “Supervised” Methods (this DM course)
Training data has both predictor attributes & objective (to be predicted) attributes
Predict discrete classes classification Predict continuous values regression Duality: classification regression
“Unsupervised” Methods Training data without objective attributes Goal: find novel & interesting patterns Cutting-edge research, fewer success stories Semi-supervised methods: market-basket, …
December, 2008 © 2008, Jaime G. Carbonell 27
Machine Learning Application Process in a Nutshell Choose problem where
Prediction is valuable and non-trivial Sufficient historical data is available The objective is measurable (incl in past data)
Prepare the data Tabular form, clean, divide training & test sets
Select a Machine Learning algorithm Human readable decision fn rules, trees, … Robust with noisy data kNN, logistic reg, …
December, 2008 © 2008, Jaime G. Carbonell 28
Machine Learning Application Process in a Nutshell (2) Train ML Algorithm on Training Data Set
Each ML method has different training process Training uses both predictor & objective att’s
Run Training ML Algorithm on Test Data Set Test uses only predictor att’s & outputs
predictions on objective attributes Compare predictions vs actual objective att’s
(see lecture 2 for evaluation metrics) If Accuracy threshold, done.
Else, try different ML algorithm, different parameter settings, get more training data, …
December, 2008 © 2008, Jaime G. Carbonell 29
Sample DB Table (same) [predictor attributes] [objective]
Tot Num Max NumAcct. Income Job Delinq Delinq Owns Credit Goodnumb. in K/yr Now? accts cycles home? years cust.?---------------------------------------------------------------------------1001 85 Y 1 1 N 2 Y1002 60 Y 3 2 Y 5 N1003 ? N 0 0 N 2 N1004 95 Y 1 2 N 9 Y1005 100 Y 1 6 Y 3 Y1006 29 Y 2 1 Y 1 N1007 88 Y 6 4 Y 8 N1008 80 Y 0 0 Y 0 Y1009 31 Y 1 1 N 1 Y1011 ? Y ? 0 ? 7 Y1012 75 ? 2 4 N 2 N1013 20 N 1 1 N 3 N1014 65 Y 1 3 Y 1 Y1015 65 N 1 2 N 8 Y1016 20 N 0 0 N 0 N1017 75 Y 1 3 N 2 N1018 40 N 0 0 Y 10 Y
December, 2008 © 2008, Jaime G. Carbonell 30
Feature Vector Representation Predictor-attribute rows in DB tables can be
represented as vectors. For instance, the 2nd & 4th rows of predictor attributes in our DB table are:
R2 = [60 Y 3 2 Y 5]
R4 = [95 Y 1 2 N 9]
Converting to numbers (Y = 1, N = 0), we get:
R2 = [60 1 3 2 1 5]
R4 = [95 1 1 2 0 9]
December, 2008 © 2008, Jaime G. Carbonell 31
Vector Similarity Suppose we have a new credit applicant
R-new = [65 1 1 2 0 10]
To which of R2 or R4 is she closer?
R2 = [60 1 3 2 1 5]
R4 = [95 1 1 2 0 9]
What should we use as a SIMILARITY METRIC? Should we first NORMALIZE the vectors?
If not, the largest component will dominate
December, 2008 © 2008, Jaime G. Carbonell 32
Normalizing Vector Attributes Linear Normalization (often sufficient)
Find max & min values for each attribute Normalize each attribute by:
Apply to all vectors (historical + new) …by normalizing each attribute, e.g.:
)(
)(
minmax
min
AA
AAA actual
norm
5.0)20100()2060(1,2 RA
December, 2008 © 2008, Jaime G. Carbonell 33
Normalizing Full Vectors Normalizing the new applicant vector
R-new = [65 1 1 2 0 10] [.56 1 .17 .33 0 1] And normalizing the two past customer vectors
R2 = [60 1 3 2 1 5] [.50 1 .50 .33 1 .50] R4 = [95 1 1 2 0 9] [.94 1 .17 .33 0 .90]
How about if some attributes are known to be more important, say salary (A1) & delinquencies (A3)? Weight accordingly, e.g. x2 for each E.g., R-new-weighted: [1.12 1 .34 .33 0 1]
December, 2008 © 2008, Jaime G. Carbonell 34
Similarity Functions (inverse dist) Now that we have weighted normalized vectors,
how do we tell exactly their degree of similarity? Inverse sum of differences (L1)
Inverse Euclidean distance (L2)
||
1),(
,...1i
nii
diffinv babasim
2
,...1
)(
1),(
ini
i
Euclidba
basim
December, 2008 © 2008, Jaime G. Carbonell 35
Similarity Functions (direct) Dot-Product Similarity
Cosine Similarity (dot product of unit vectors)
ini
idot bababasim
,...,1
),(
nii
nii
ini
i
ba
ba
ba
babasim
,...,1
2
,...,1
2
,...,1cos ),(
December, 2008 © 2008, Jaime G. Carbonell 36
Alternative: Similarity Matrix for Non-Numeric Attributes
tiny little small medium large hugetiny 1.0 0.8 0.7 0.5 0.2 0.0little 1.0 0.9 0.7 0.3 0.1small 1.0 0.7 0.3 0.2medium 1.0 0.5 0.3large 1.0 0.8huge 1.0
Diagonal must be 1.0 Monotonicity property must hold Triangle inequality must hold Transitive property must hold Additivity/Compostionality need not hold
December, 2008 © 2008, Jaime G. Carbonell 37
k-Nearest Neighbors Method No explicit “training” phase When new case arrives (vector of predictor att’s)
Find nearest k neighbors (max similarity) among previous cases (row vectors in DB table)
k-neighbors vote for objective attribute Unweighted majority vote, or Similarity-weighted vote
Works for both discrete or continuous objective attributes
December, 2008 © 2008, Jaime G. Carbonell 38
Similarity-Weighted Voting in kNN If the Objective Attribute is Discrete:
If the Objective Attribute is Continuous:
])([&)]([)(),(maxarg)(
ijobjjobji CxvalueykNNxj
ValueRangeCobj yxsimyValue
)(
)(
),(
),()(
)(
ykNNxj
ykNNxjjobj
obj
j
j
yxsim
yxsimxvalue
yValue
December, 2008 © 2008, Jaime G. Carbonell 39
Applying kNN to Real Problems 1 How does one choose the vector representation?
Easy: Vector = predictor attributes What if attributes are not numerical?
Convert: (e.g. High=2, Med=1, Low=0), Or, use similarity function over nominal values
E.g. equality or edit-distance on strings How does one choose a distance function?
Hard: No magic recipe; try simpler ones first This implies a need for systematic testing
(discussed in coming slides)
December, 2008 © 2008, Jaime G. Carbonell 40
Applying kNN to Real Problems 2
How does one determine whether data should be normalized? Normalization is usually a good idea One can try kNN both ways to make sure
How does one determine “k” in kNN? k is often determined empirically Good start is:
))((log2 DBsizek
December, 2008 © 2008, Jaime G. Carbonell 41
Evaluating Machine Learning Accuracy = Correct-Predictions/Total-Predictions
Simplest & most popular metric But misleading on very-rare event prediction
Precision, recall & F1 Borrowed from Information Retrieval Applicable to very-rare event prediction
Correlation (between predicted & actual values) for continuous objective attributes R2, kappa-coefficient, …
December, 2008 © 2008, Jaime G. Carbonell 42
Sample Confusion Matrix
ShortedPower Sup
LooseConnect’s
BurntResistor
Not plugged in
ShortedPower Sup
50 0 10 0
LooseConnect’s
1 120 0 12
BurntResistor
12 0 60 0
Not plugged in
0 8 5 110
True Diagnoses
Pre
dic
ted
Dia
gn
oses
December, 2008 © 2008, Jaime G. Carbonell 43
Measuring Accuracy Accuracy = correct/total Error = incorrect/total Hence: accuracy = 1 – error
For the diagnosis example: A = 340/386 = 0.88, E = 1 – A = 0.12
ni njji
niii
c
c
CFull
CTraceA
,...1 ,...,1,
,...,1,
)(
)(
December, 2008 © 2008, Jaime G. Carbonell 44
What About Rare Events?
ShortedPower Sup
LooseConnect’s
BurntResistor
Not plugged in
ShortedPower Sup
0 0 10 0
LooseConnect’s
1 120 0 12
BurntResistor
12 0 60 0
Not plugged in
0 8 5 160
True Diagnoses
Pre
dic
ted
Dia
gn
oses
December, 2008 © 2008, Jaime G. Carbonell 45
Rare Event Evaluation Accuracy for example = 0.88
…but NO correct predictions for “shorted power supply”, 1 of 4 diagnoses
Alternative: Per-diagnosis (per-class) accuracy:
A(“shorted PS”) = 0/22 = 0 A(“not plugged in”) = 160/184 = 0.87
njij
jinj
ij
iii
cc
cclassA
,...,1,,
,...,1,
,)(
December, 2008 © 2008, Jaime G. Carbonell 47
ROC Curves (ROC=Receiver Operating Characteristic)
Sensitivity = TP/(TP+FN)
Specificity = TN/(TN+FP)
December, 2008 © 2008, Jaime G. Carbonell 48
If Plenty of data, evaluate with Holdout Set
Data
evaluate
measure error
train
Often also used for parameter optimization
December, 2008 © 2008, Jaime G. Carbonell 49
Finite Cross-Validation Set True error:
Test error:
D
D ydxyxpxfye ,),(),(
Syx
S xfym
e,
),(1
ˆ
D = all data
m = #test samples S = test data
(true risk)
(empirical risk)
December, 2008 © 2008, Jaime G. Carbonell 50
Confidence Intervals
If S contains m examples, drawn independently m 30
Then With approximately 95% probability, the true
error eD lies in the interval
m
eee SS
S
)ˆ1(ˆ96.1ˆ
December, 2008 © 2008, Jaime G. Carbonell 51
Example: Hypothesis misclassifies 12 out of 40 examples in
cross validation set S. Q: What will the “true” error on future examples? A: With 95% confidence, the true error will be in
the interval:
m
eee SS
S
)ˆ1(ˆ96.1ˆ]44.0;16.0[
3.040
12ˆ Se40m 14.0
)ˆ1(ˆ96.1
m
ee SS
Confidence Intervals
If S contains n examples, drawn independently n 30
Then With approximately N% probability, the true
error eD lies in the interval
m
eeze SS
NS
)ˆ1(ˆˆ
N% 50% 68% 80% 90% 95% 98% 99%
zN 0.67 1.0 1.28 1.64 1.96 2.33 2.58
December, 2008 © 2008, Jaime G. Carbonell 53
Finite Cross-Validation Set True error:
Test error:
Number of test errors: Is Binomially distributed:
D
D ydxyxpxfye ,),(),(
Syx
S xfym
e,
),(1
ˆ
knD
kD
Syx
eekmk
mkxfyp
)1()(
)!(!
!),(
,
k-fold Cross ValidationData
Train on yellow, evaluate on pink error5
Train on yellow, evaluate on pink error6
Train on yellow, evaluate on pink error7
Train on yellow, evaluate on pink error1
Train on yellow, evaluate on pink error3
Train on yellow, evaluate on pink error4
Train on yellow, evaluate on pink error8
Train on yellow, evaluate on pink error2
error = errori / k
k-way split
December, 2008 © 2008, Jaime G. Carbonell 55
Cross Validation Procedure Purpose: Evaluate DM accuracy on training data Experiment: Try different similarity functions, etc. Process:
Divide the training data into k equal pieces (each piece is called a “fold”)
Train the classifier using all but kth fold Test for accuracy on kth fold Repeat with kth-1 fold held out for testing, then
with kth-2 fold for testing, till tested on all folds Report the average accuracy across folds
December, 2008 © 2008, Jaime G. Carbonell 57
Comparing Different Hypotheses: Paired t test True difference:
For each partition k:
Average:
N% Confidence interval:
k
iid
kd
1
ˆ1ˆ
)()( 21 DD eed
k
iikN kk
td1
21, )ˆˆ(
)1(
1ˆ
test error for partition k
)(ˆ)(ˆˆ2,1, kSkSk eed
k-1 is degrees of freedom N is confidence level
December, 2008 © 2008, Jaime G. Carbonell 58
Version Spaces (Mitchell, 1980)
G boundary
S boundary
“Target” concept N
Specific Instances
Anything
b
December, 2008 © 2008, Jaime G. Carbonell 59
Original & Seeded Version Spaces Version-spaces (Mitchell, 1980)
Symbolic multivariate learning S & G sets define lattice boundaries Exponential worst-case: O(bN)
Seeded Version Spaces (Carbonell, 2002)
Generality level hypothesis seed S & G subsets effective lattice Polynomial worst case: O(bk/2), k=3,4
December, 2008 © 2008, Jaime G. Carbonell 60
Seeded Version Spaces (Carbonell, 2002)
G boundary
S boundary
“Target” concept N
Xn Ym
“The big book” “ el libro grande”
Det Adj N Det N Adj(Y2 num) = (Y3 num)(Y2 gen) = (Y3 gen)(X3 num) = (Y2 num)
December, 2008 © 2008, Jaime G. Carbonell 61
Seeded Version Spaces
S boundary
“Target” concept Seed(best guess)
kN
G boundary
Xn Ym
“The big book” “ el libro grande”
December, 2008 © 2008, Jaime G. Carbonell 62
Naïve Bayes Classification
Some Notation:
Training instance index i = 1, 2, …, I Term index j = 1, 2, …, J Category index k = 1, 2, …, K
Training data D (k) = ((xi, yi (k) ))
Instance feature vector xi = (1, ni1, ni2, …, niJ),
Output labels yi = (yi (1) , yi (2) , …, yi(K) ) , yi
(k) = 1 or 0
December, 2008 © 2008, Jaime G. Carbonell 63
Bayes Classifier Assigning the most probable category to x
?)|,,(ˆ)|(ˆ1 kiJiki cnnPcxP
)|(log)(logmaxarg
)|()(maxarg
)(
)|()(maxarg
)(maxargˆ
kkk
kkk
kkk
kk
cxPcP
cxPcP
xP
cxPcP
|xcPc
in instances trainingof #
)(ˆI
ccP k
k
Bayes Rule
(MLE)
(Multinomial Distribution)
December, 2008 © 2008, Jaime G. Carbonell 64
Maximum Likelihood Estimate (MLE)
n: # of objects in a random sample from an populationm: # of instances of a category among the n-object samplep: true probability of any object belonging to the category
Likelihood of observing the data given model p is defined as:
Setting the derivative of f(p) to zero yields:
)()1log()(log)1(log
i.i.d. assuming,)1()|(
)(~},1,0{,)|,,()|()|(
1
1
pfpmnpmpp
pppYP
pBerYYpYYPpDPpDL
mnm
mnmn
i i
iinnn
n
mppmnmp
p
mn
p
mpf
dp
d
,)()1(,1
)(0
December, 2008 © 2008, Jaime G. Carbonell 65
Binomial Distribution
Consider coin toss as a Bernoulli process, X ~ Ber(p)
qpTailPpHeadP 1)(,)(
3232
!3!2
!5
2
5)5|2 is heads of (# qpqpnP
What is the probability of seeing 2 heads out of 5 tosses?
knkn
i i ppk
nkYPpnBinYXY
)1()(,),(~,
1
Observing k heads in n tosses follows a binomial distribution:
December, 2008 © 2008, Jaime G. Carbonell 66
6
16216611 ......),,(
k
nk
kpnnn
nnXnXP
Multinomial Distribution
Consider tossing a 6-faced dice n times with probabilities p1, p2, …, p6 where the probabilities sum up to 1.
Let the count of observing each face as a random variable, we have a multinomial process defined as
.,0
)...,,(~),,,(6
1
621621
nXnX
pppnMulXXX
j jj
December, 2008 © 2008, Jaime G. Carbonell 67
Multinomial NB
The conditional probability is
We can remove the first term from the objective function
term)a is ()|(!!...!
!)|()|(
121
,
j
J
j
nj
xJxx
xx tctP
nnn
ncnPcxP jx
J
j jxjj
nj ctPnctPcxP jx
1)|(log)|()|(
December, 2008 © 2008, Jaime G. Carbonell 68
Smoothing Methods
Laplace Smoothing (common)
Two-state Hidden Markov Model (BBN, or Jelinek-Mercer Interpolation)
Hierarchical Smoothing (McCallum, ICML’98)
Lambda’s (summing to 1) are the mixture weights, obtained by running an EM algorithm on a validation set.
Vtct
ct
nV
nctP
|
|
||
1)|(~
)()1()|()|(~ tPctPctP
)|(...)|()|()|(~ 2
21h
h ctPctPctPctP
December, 2008 © 2008, Jaime G. Carbonell 69
Basic Assumptions
Term independence:
Expecting one objective attribute y per instance:
Continuity of instances in the same class (one-mode per class)
1)( k
kcP
||
||,2,1,
)|(!!...!
!)|(maxarg)|(maxarg
V
Vt
nk
Vddd
dkdkkk
dtctPnnn
ncnPcdP
...)|()|()|( 2121 ii n
kn
kki ctPctPcxP
December, 2008 © 2008, Jaime G. Carbonell 70
NB and Cross Entropy
Entropy Measuring the uncertainty –
lower entropy means easier predictions
Minimum coding length if distribution p is known
Cross Entropy Measuring the coding length
(in # of bits) based on distribution q when the true distribution is p q)D(ppH
p
qppp
p
qpp
qpqH(p
k
k
kkk
kk
k
kk
kk
kk
k
||)(
loglog
log
log)||
1),,,(
log)(
1
kkK
kk
k
pppp
pppH
December, 2008 © 2008, Jaime G. Carbonell 71
Kullback Liebler (KL) Divergence
Also called “Relative Entropy” Measuring the difference between two
distributions Zero valued if p = q Not inter-changeable
k
k
kk p
qpq)D(p log||
NB and Cross Entropy (cont’d)
December, 2008 © 2008, Jaime G. Carbonell 72
NB & Cross Entropy (cont’d)
)||()(logminarg
)||()(logmaxarg
)|(log)|(ˆ)(logmaxarg
)|(log)(log
maxarg
)|(log)(logmaxarg*
ki
ki
ij
ij
cxkk
cxkk
tkjij
i
k
k
kxt i
ij
i
k
k
kjxt
ijkk
qpHcP
qpHcP
ctPxtPn
cP
ctPn
n
n
cP
ctPncPk
Minimum Description Length (MDL) Classifier
December, 2008 © 2008, Jaime G. Carbonell 73
Concluding Remarks on NB
Pros Explicit probabilistic reasoning Relatively effective, fast online response (as an eager learning)
Cons Scoring function (logarithm of term probabilities) would be too
sensitive to measurement errors on rare features One-class-per-instance assumption imposes both theoretical
and practical limitations Empirically weak when dealing with rare categories and large
feature sets
December, 2008 © 2008, Jaime G. Carbonell 74
Statistical Decision Theory
Random input X in RJ
Random output Y in {1,2, …, K} Prediction f(X) in {1,2, …, K} Loss function (0-1 loss for classification)
L(y(x), f(x)) = 0 iff f(x) = y(x) L(y(x), f(x)) = 1 otherwise
Expected Prediction Error (EPE)
}{maxarg}1{minarg
))(,(minarg)(ˆ
)|())(,(
)(},,1{
)(},,1{
1
)(},,1{)(
1
kxKk
kxKk
K
k
kxKxf
K
kX
xfkLxf
XYPXfYLEPE
Minimizing EPE pointwise
December, 2008 © 2008, Jaime G. Carbonell 75
Selection of ML Algorithm (I)
Method Training Data Requirements
Random Noise Tolerance
Scalability (atts + data)
Rule Induction Sparse None Good
Decision Trees Sparse-Dense Some Excellent
Naïve Bayes Medium-Dense Some-Good Medium
Regression Medium-Dense Some-Good Good
kNN Sparse-Dense Some-Good Good-Excellent
SVM Medium-Dense Some-Good Good-Excellent
Neural Nets Dense Good Poor-Medium
December, 2008 © 2008, Jaime G. Carbonell 76
Selection of ML Algorithm (II)
Method Quality of Prediction
Explanatory Power
Popularity of Usage
Rule Induction Good, brittle Very clear Med, declining
Decision Trees Good/category Very clear High, stable
Naïve Bayes Medium/cat Partial Med, declining
Regression Good/both Partial-Poor High, stable
kNN Good/both Partial-Good Med, increasing
SVM Very good/cat Poor Med, increasing
Neural Nets Good/cat Poor High, declining