Upload
duongminh
View
221
Download
0
Embed Size (px)
Citation preview
Modeling with Rules
Cynthia Rudin Assistant Professor of Sta8s8cs
Massachuse:s Ins8tute of Technology joint work with:
David Madigan (Columbia) Allison Chang, Ben Letham (MIT PhD students)
Dimitris Bertsimas (MIT) Tyler McCormick (UW)
Gene Kogan (Independent)
Would like predic8ve models that are both accurate and interpretable.
Accuracy = classifica8on accuracy Interpretability = ?
Would like predic8ve models that are both accurate and interpretable.
Accuracy = classifica8on accuracy Interpretability =
concise -‐ model is small convincing -‐ there are reasons behind each predic8on
Decision List
fenway park=1 1 97/100 8mes
rush_hour=0 -‐1 474/523 8mes
rain=0, construc8on=0 -‐1 329/482 8mes
Friday=1 -‐1 3/3 8mes rain=1 1 452/892 8mes
Traffic jam in Boston?
Modeling with Rules
otherwise -‐1 10/15 8mes
Modeling with Rules Dichotomy in the State of the Art
Accuracy
vs.
Interpretability
Decision Trees
Support Vector Machines Boosted Decision Trees
Modeling with Rules Daydreaming
• Nice if the whole algorithm were interpretable OR • Want the accuracy of SVM/Boosted DT and the interpretability of Decision Trees.
• Part 1: Humans can interpret the predictions, and understand the full algorithm
• Part 2: Bayesian hierarchical modeling with rules
• Part 3: Accurate rule classifiers using MIO
Sequen8al Event Predic8on with Associa8on Rules (R, Letham, Aouissi, Kogan, Madigan) -‐ COLT 2011
A Hierarchical Model for Associa8on Rule Mining of Sequen8al Events: An Approach to Automated Medical Symptom Predic8on. (McCormick, R, Madigan) – Annals of Applied Sta8s8cs, forthcoming 2012
Ordered Rules for Classifica8on: A Discrete Op8miza8on Approach to Associa8ve Classifica8on (Bertsimas, Chang, R) – In progress
Outline
Associa8on Rule Mining: (Agrawal; Imielinski; Swami, 1993) & (Agrawal and Srikant, 1994)
Construc8on=1
Rain=1
Traffic=1
15 8mes we saw construc8on and rain, and 13 out of 15 of those 8mes we also saw traffic
Supp(construction=1 & rain=1) = 15Supp(traffic=1 & construction=1 & rain=1) = 13
Conf (contruction=1 & rain=1 → traffic=1) = 13 /15
“Max Confidence, Min Support” Algorithm Step 1. Find all rules , where
Step 2. Rank rules in descending order of recommend the right hand side of the first rule that applies.
15 13/15=.867
25 20/25=.8
17 12/17=.706
50 34/50=.68
a→ b Supp(a) ≥θ .Conf (a→ b),
Conf (a→ b),Supp(a)
1
rush hour=0 -‐1
Friday=1 -‐1
otherwise -‐1
15 8mes we saw construc8on and rain, and 13 out of 15 of those 8mes we also saw traffic
Conf=.99, Supp=10000 vs. Conf=1, Supp=10
Supp(construction=1 & rain=1) = 15Supp(traffic=1 & construction=1 & rain=1) = 13
Conf (contruction=1 & rain=1 → traffic=1) = 13 /15
Bayesian version of the confidence
AdjustedConf (a→ b) := Supp(a&b)Supp(a)+ K
“Adjusted Confidence” Algorithm Step 1 Find all rules .
Step 2. Rank rules in descending order of recommend the right hand side of the first rule that applies.
25 20/(25+5)=.67
15 13/(15+5)=.65
50 34/(50+5)=.62
17 12/(17+5)=.55
a→ bAdjustedConf (a→ b),
AdjustedConf (a→ b), K = 5Supp(a)
rush hour=0 -‐1
Friday=1 0
otherwise -‐1
1
• Rare rules can be used
• Among rules with similar confidence, prefers rules with higher support
• K encourages larger support, helps with predic8on
Conf=.99, Support=10000 vs. Conf=1, Support=10
AdjustedConf (a→ b) := Supp(a&b)Supp(a)+ K
– Humans can understand the prediction, and the algorithm
– Good for sequential event problems, where a set of events happen in a particular order • e.g., for predicting what a customer will put next into an online
shopping cart, or for predicting medical symptoms in a sequence
– Having larger K helps with generalization • algorithmic stability (pointwise hypothesis stability) • other learning theoretic implications
– Performs better empirically than the Max-Conf Min-Support Classifiers in our experiments
A Learning Theory Framework for Associa8on Rules and Sequen8al Events (R, Letham, Kogan, Madigan) – SSRN 2011
Sequen8al Event Predic8on with Associa8on Rules (R, Letham, Aouissi, Kogan, Madigan) -‐ COLT 2011
• Part 1: Humans can interpret the predictions, and understand the full algorithm
• Part 2: Bayesian hierarchical modeling with rules
• Part 3: Accurate rule classifiers using MIO
Sequen8al Event Predic8on with Associa8on Rules (R, Letham, Aouissi, Kogan, Madigan) -‐ COLT 2011
A Hierarchical Model for Associa8on Rule Mining of Sequen8al Events: An Approach to Automated Medical Symptom Predic8on. (McCormick, R, Madigan) – Annals of Applied Sta8s8cs, forthcoming 2012
Ordered Rules for Classifica8on: A Discrete Op8miza8on Approach to Associa8ve Classifica8on (Bertsimas, Chang, R) – In progress
Outline
Recommender Systems for Medical Condi8ons
Predic8on based on your medical history:
Input medical condi8on:
Recommender Systems for Medical Condi8ons
Predic8on based on your medical history:
Input medical condi8on:
dyspepsia & epigastric pain heartburn depression high blood pressure
Gastroesophageal reflux high blood pressure
heartburn headache dyspepsia
fungal infec8on heartburn
epigastric pain hypertension dyspepsia
Recommenda8ons 1. rhini8s 2. dyspepsia 3. low back pain
Recommenda8ons 1. dyspepsia 2. high blood pressure 3. low back pain
Recommenda8ons 1. epigastric pain 2. heartburn 3. high blood pressure
t
Medical Condi8on Predic8on
Hierarchical Associa8on Rule Model (HARM)
Hierarchical Associa8on Rule Model (HARM) i patient index, r rule index of lhsr → rhsr
Hierarchical Associa8on Rule Model (HARM) i patient index, r rule index of lhsr → rhsryir := Suppi (rhsr& lhsr )nir := Suppi (lhsr )
We'll model yir ~ Binomial(nir , pir )
shared across individuals
Hierarchical Associa8on Rule Model (HARM) i patient index, r rule index of lhsr → rhsryir := Suppi (rhsr& lhsr )nir := Suppi (lhsr )
We'll model yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )
Hierarchical Associa8on Rule Model (HARM) i patient index, r rule index of lhsr → rhsryir := Suppi (rhsr& lhsr )nir := Suppi (lhsr )
We'll model yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )
Under this model, E(pir | yir ,nir ) =yir +π ir
nir +π ir +τ i.
Hierarchical Associa8on Rule Model (HARM) i patient index, r rule index of lhsr → rhsryir := Suppi (rhsr& lhsr )nir := Suppi (lhsr )
We'll model yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )π ir = exp(M 'i βr + γ i )
Hierarchical Associa8on Rule Model (HARM)
π ir = exp(M 'i βr + γ i )
Hierarchical Associa8on Rule Model (HARM)
π ir = exp(M 'i βr + γ i )
M∈ I×D (observable characteristics)
1 1
Hierarchical Associa8on Rule Model (HARM)
π ir = exp(M 'i βr + γ i )
M∈ I×D (observable characteristics)
1 1
Example: π ir = exp(βr ,0 + βr ,11male + γ i ) = exp(βr ,11male )exp(βr ,0 + γ i )
Hierarchical Associa8on Rule Model (HARM) i patient index, r rule index of lhsr → rhsryir := Suppi (rhsr& lhsr )nir := Suppi (lhsr )
We'll model yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )π ir = exp(M 'i βr + γ i )
yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )π ir = exp(M 'i βr + γ i )
Hierarchical Associa8on Rule Model (HARM)
log(τ i ) ~ Normal(0,στ2 )
log(βrd ) ~ Normal(µβ ,σ β2 )
log(γ i ) ~ Normal(µγ ,σγ2 )
diffuse uniform priors on µβ ,σ β2 ,στ
2
HARM estimates posterior distribution (MCMC), then ranks rules by posterior mean.
Hierarchical Associa8on Rule Model (HARM)
• 43,000 pa8ent encounters • ~2,300 pa8ents, age (> 40) • pre-‐exis8ng condi8ons dealt with separately • used 25 most common condi8ons, and 25 least common condi8ons
t
training test
For trials=1:500 • Form training and test sets:
– sample ~200 patients – for each patient, randomly split encounters into training and
test
t
For trials=1:500 • Form training and test sets:
– sample ~200 patients – for each patient, randomly split encounters into training and
test • For each patient, iteratively make predictions on test encounters
– get 1 point whenever our top 3 recommendations contain patient’s next condition
training test
●
●
●●
●
●●●● ●
●
●
●
●
●
●●
●
●
●
●
●●●●●
HAR
MC
onf.
Adj.
k=.2
5Ad
j. k=
.5Ad
j. k=
1Ad
j. k
=2Th
resh
.=2
Thre
sh.=
3
0.0
0.1
0.2
0.3
0.4
0.5
0.6
(a) All patients
Prop
ortio
n of
cor
rect
pre
dict
ions
Myocardial infarc8on in pa8ents with hypertension, in treatment (T) and placebo (P) groups
Key: Middle half
Middle 90%
Mean of posterior means
HARM Confidence
P T 40−50
P T 51−60
P T 61−70
P T Over 70
T P 40−50
T P 51−60
T P 61−70
T P Over 70
Rescaled
Risk
Key: Middle half
Middle 90%
Mean of posterior means
HARM Confidence
Myocardial infarc8on in pa8ents with high cholesterol, in treatment (T) and placebo (P) groups
P T 40−50
P T 51−60
P T 61−70
P T Over 70
P T 40−50
P T 51−60
P T 61−70
P T Over 70
Rescaled
Risk
• Part 1: Humans can interpret the predictions, and understand the full algorithm
• Part 2: Bayesian hierarchical modeling with rules
• Part 3: Accurate rule classifiers using MIO
Sequen8al Event Predic8on with Associa8on Rules (R, Letham, Aouissi, Kogan, Madigan) -‐ COLT 2011
A Hierarchical Model for Associa8on Rule Mining of Sequen8al Events: An Approach to Automated Medical Symptom Predic8on. (McCormick, R, Madigan) – Annals of Applied Sta8s8cs, forthcoming 2012
Ordered Rules for Classifica8on: A Discrete Op8miza8on Approach to Associa8ve Classifica8on (Bertsimas, Chang, R) – In progress
Outline
Mixed Integer Optimization
• MIO/MIP is a style of mathematical programming • Not generally used for ML – perception from 1970’s that MIO’s are intractable
Mixed Integer Optimization
• MIO/MIP is a style of mathematical programming • Not generally used for ML – perception from 1970’s that MIO’s are intractable
• Not all valid MIO formulations are equally strong
Mixed Integer Optimization
• MIO/MIP is a style of mathematical programming • Not generally used for ML – perception from 1970’s that MIO’s are intractable
• Not all valid MIO formulations are equally strong • Can use LP relaxations for very large scale problems
Mixed Integer Optimization
• MIO/MIP is a style of mathematical programming • Not generally used for ML – perception from 1970’s that MIO’s are intractable
• Not all valid MIO formulations are equally strong • Can use LP relaxations for very large scale problems • Associa8on rules historically plagued by “combinatorial explosion”...
Ordered Rules for Classifica8on • Minimize misclassifica8on error, regularize by height of the highest null rule.
43
“null rules”: higher one predicts the default class and ends the list.
MIO Learning Algorithm
MIO Learning Algorithm Maximize classificaAon accuracy
MIO Learning Algorithm Maximize classificaAon accuracy
Maximize rank of the highest null rule (regularizaAon)
Experiments
• Five algorithms – Logis8c Regression (LogReg) – Support Vector Machines / RBF kernel (SVM) – Classifica8on and Regression Trees (CART) – Boosted Decision Trees (AdaBoost) – Ordered Rules for Classifica8on (ORC)
• Several publicly available datasets (UCI) • Accuracy averaged over 3 folds
Classifica8on Accuracy
o
~x
o
o o
~x ~x
x
x
~o
yes no
1
0
.26
.47 .92
: :
: :
: :
CART on Tic Tac Toe
CART accuracy = 0.9388715
ORC on Tic Tac Toe
x
x
x
x x x
x
x
x
x
x
x
x
x
x
x
x
x
x x x
x x x
x wins
1 2 3 4 5 6 7 8
9
x wins x wins x wins x wins x wins x wins
x does not win
x wins
ORC accuracy = 1
MONKS Problems 1
• 6 Integer valued features taking values 1,2,3,4 • Examples are in class 1 if either a1=a2 or a5=1
CART on MONKS Problems 1
• Examples are in class 1 if either a1=a2 or a5=1
ORC on MONKS Problems 1
• Examples are in class 1 if either a1=a2 or a5=1
a1=3, a2=3 →1 (33/33)a1=2, a2=2 →1 (30/30)a5=1 →1 (65/65) a1=1, a2=1 →1 (31/31)∅ →−1 (152/288)
• The bo:om line: You don’t need to sacrifice accuracy to get interpretability.
• Part 1: Humans can interpret the predictions, and understand the full algorithm
• Part 2: Bayesian hierarchical modeling with rules
• Part 3: Accurate rule classifiers using MIO
Sequen8al Event Predic8on with Associa8on Rules (R, Letham, Aouissi, Kogan, Madigan) -‐ COLT 2011
A Hierarchical Model for Associa8on Rule Mining of Sequen8al Events: An Approach to Automated Medical Symptom Predic8on. (McCormick, R, Madigan) – Annals of Applied Sta8s8cs, forthcoming 2012
Ordered Rules for Classifica8on: A Discrete Op8miza8on Approach to Associa8ve Classifica8on (Bertsimas, Chang, R) – In progress
Outline
current work coming up
Associa8on Rules/ Associa8ve Classifica8on
Decision Trees
Decision Lists
Logical Analysis of Data (LAD)
Bayesian Analysis ML algorithms that use rules as features
Current Work • Machine Learning for the NYC Power Grid
– cover of IEEE Computer, spotlight issue for IEEE TPAMI in February, WIRED Science, Slashdot, US News & World Report...
• Supervised Ranking, Equivalences between Ranking and Classifica8on, Ranking with MIO
• Reverse-‐Engineering Quality Rankings – in Businessweek last week
• ML algorithms that understand how they will be used for a subsequent task
• Several other projects
Thank you!