Modeling)with)Rules) - ima.umn.edu · dyspepsia&)epigastric)pain) )heartburn) depressionhighbloodpressure) ... • Can use LP relaxations for very large scale problems! Mixed Integer

Modeling with Rules

Cynthia Rudin Assistant Professor of Sta8s8cs

Massachuse:s Ins8tute of Technology joint work with:

David Madigan (Columbia) Allison Chang, Ben Letham (MIT PhD students)

Dimitris Bertsimas (MIT) Tyler McCormick (UW)

Gene Kogan (Independent)

Would like predic8ve models that are both accurate and interpretable.

Accuracy = classifica8on accuracy Interpretability = ?

Would like predic8ve models that are both accurate and interpretable.

Accuracy = classifica8on accuracy Interpretability =

concise -‐ model is small convincing -‐ there are reasons behind each predic8on

Decision List

fenway park=1 1 97/100 8mes

rush_hour=0 -‐1 474/523 8mes

rain=0, construc8on=0 -‐1 329/482 8mes

Friday=1 -‐1 3/3 8mes rain=1 1 452/892 8mes

Traffic jam in Boston?

Modeling with Rules

otherwise -‐1 10/15 8mes

Modeling with Rules Dichotomy in the State of the Art

Accuracy

vs.

Interpretability

Decision Trees

Support Vector Machines Boosted Decision Trees

Modeling with Rules Daydreaming

•  Nice if the whole algorithm were interpretable OR •  Want the accuracy of SVM/Boosted DT and the interpretability of Decision Trees.

•  Part 1: Humans can interpret the predictions, and understand the full algorithm

•  Part 2: Bayesian hierarchical modeling with rules

•  Part 3: Accurate rule classifiers using MIO

Sequen8al Event Predic8on with Associa8on Rules (R, Letham, Aouissi, Kogan, Madigan) -‐ COLT 2011

A Hierarchical Model for Associa8on Rule Mining of Sequen8al Events: An Approach to Automated Medical Symptom Predic8on. (McCormick, R, Madigan) – Annals of Applied Sta8s8cs, forthcoming 2012

Ordered Rules for Classifica8on: A Discrete Op8miza8on Approach to Associa8ve Classifica8on (Bertsimas, Chang, R) – In progress

Outline

Associa8on Rule Mining: (Agrawal; Imielinski; Swami, 1993) & (Agrawal and Srikant, 1994)

Construc8on=1

Rain=1

Traffic=1

15 8mes we saw construc8on and rain, and 13 out of 15 of those 8mes we also saw traffic

Supp(construction=1 & rain=1) = 15Supp(traffic=1 & construction=1 & rain=1) = 13

Conf (contruction=1 & rain=1 → traffic=1) = 13 /15

“Max Confidence, Min Support” Algorithm Step 1. Find all rules , where

Step 2. Rank rules in descending order of recommend the right hand side of the first rule that applies.

15 13/15=.867

25 20/25=.8

17 12/17=.706

50 34/50=.68

a→ b Supp(a) ≥θ .Conf (a→ b),

Conf (a→ b),Supp(a)

1

rush hour=0 -‐1

Friday=1 -‐1

otherwise -‐1

15 8mes we saw construc8on and rain, and 13 out of 15 of those 8mes we also saw traffic

Conf=.99, Supp=10000 vs. Conf=1, Supp=10

Supp(construction=1 & rain=1) = 15Supp(traffic=1 & construction=1 & rain=1) = 13

Conf (contruction=1 & rain=1 → traffic=1) = 13 /15

Bayesian version of the confidence

AdjustedConf (a→ b) := Supp(a&b)Supp(a)+ K

“Adjusted Confidence” Algorithm Step 1 Find all rules .

Step 2. Rank rules in descending order of recommend the right hand side of the first rule that applies.

25 20/(25+5)=.67

15 13/(15+5)=.65

50 34/(50+5)=.62

17 12/(17+5)=.55

a→ bAdjustedConf (a→ b),

AdjustedConf (a→ b), K = 5Supp(a)

rush hour=0 -‐1

Friday=1 0

otherwise -‐1

1

•  Rare rules can be used

•  Among rules with similar confidence, prefers rules with higher support

•  K encourages larger support, helps with predic8on

Conf=.99, Support=10000 vs. Conf=1, Support=10

AdjustedConf (a→ b) := Supp(a&b)Supp(a)+ K

– Humans can understand the prediction, and the algorithm

– Good for sequential event problems, where a set of events happen in a particular order •  e.g., for predicting what a customer will put next into an online

shopping cart, or for predicting medical symptoms in a sequence

– Having larger K helps with generalization •  algorithmic stability (pointwise hypothesis stability) •  other learning theoretic implications

–  Performs better empirically than the Max-Conf Min-Support Classifiers in our experiments

A Learning Theory Framework for Associa8on Rules and Sequen8al Events (R, Letham, Kogan, Madigan) – SSRN 2011








Outline

Recommender Systems for Medical Condi8ons

Predic8on based on your medical history:

Input medical condi8on:

Recommender Systems for Medical Condi8ons

Predic8on based on your medical history:

Input medical condi8on:

dyspepsia & epigastric pain heartburn depression high blood pressure

Gastroesophageal reflux high blood pressure

heartburn headache dyspepsia

fungal infec8on heartburn

epigastric pain hypertension dyspepsia

Recommenda8ons 1. rhini8s 2. dyspepsia 3. low back pain

Recommenda8ons 1. dyspepsia 2. high blood pressure 3. low back pain

Recommenda8ons 1. epigastric pain 2. heartburn 3. high blood pressure

t

Medical Condi8on Predic8on

Hierarchical Associa8on Rule Model (HARM)

Hierarchical Associa8on Rule Model (HARM) i patient index, r rule index of lhsr → rhsr

Hierarchical Associa8on Rule Model (HARM) i patient index, r rule index of lhsr → rhsryir := Suppi (rhsr& lhsr )nir := Suppi (lhsr )

We'll model yir ~ Binomial(nir , pir )

shared across individuals


We'll model yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )


We'll model yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )

Under this model, E(pir | yir ,nir ) =yir +π ir

nir +π ir +τ i.


We'll model yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )π ir = exp(M 'i βr + γ i )


π ir = exp(M 'i βr + γ i )



M∈ I×D (observable characteristics)

1 1



M∈ I×D (observable characteristics)

1 1

Example: π ir = exp(βr ,0 + βr ,11male + γ i ) = exp(βr ,11male )exp(βr ,0 + γ i )


We'll model yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )π ir = exp(M 'i βr + γ i )

yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )π ir = exp(M 'i βr + γ i )


log(τ i ) ~ Normal(0,στ2 )

log(βrd ) ~ Normal(µβ ,σ β2 )

log(γ i ) ~ Normal(µγ ,σγ2 )

diffuse uniform priors on µβ ,σ β2 ,στ

2

HARM estimates posterior distribution (MCMC), then ranks rules by posterior mean.


•  43,000 pa8ent encounters •  ~2,300 pa8ents, age (> 40) •  pre-‐exis8ng condi8ons dealt with separately •  used 25 most common condi8ons, and 25 least common condi8ons

t

training test

For trials=1:500 •  Form training and test sets:

–  sample ~200 patients –  for each patient, randomly split encounters into training and

test

t

For trials=1:500 •  Form training and test sets:

–  sample ~200 patients –  for each patient, randomly split encounters into training and

test •  For each patient, iteratively make predictions on test encounters

–  get 1 point whenever our top 3 recommendations contain patient’s next condition

training test

●

●

●●

●

●●●● ●

●

●

●

●

●

●●

●

●

●

●

●●●●●

HAR

MC

onf.

Adj.

k=.2

5Ad

j. k=

.5Ad

j. k=

1Ad

j. k

=2Th

resh

.=2

Thre

sh.=

3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

(a) All patients

Prop

ortio

n of

cor

rect

pre

dict

ions

Myocardial infarc8on in pa8ents with hypertension, in treatment (T) and placebo (P) groups

Key: Middle half

Middle 90%

Mean of posterior means

HARM Confidence

P T 40−50

P T 51−60

P T 61−70

P T Over 70

T P 40−50

T P 51−60

T P 61−70

T P Over 70

Rescaled

Risk

Key: Middle half

Middle 90%

Mean of posterior means

HARM Confidence

Myocardial infarc8on in pa8ents with high cholesterol, in treatment (T) and placebo (P) groups

P T 40−50

P T 51−60

P T 61−70

P T Over 70

P T 40−50

P T 51−60

P T 61−70

P T Over 70

Rescaled

Risk







Outline

Mixed Integer Optimization

•  MIO/MIP is a style of mathematical programming •  Not generally used for ML – perception from 1970’s that MIO’s are intractable



•  Not all valid MIO formulations are equally strong



•  Not all valid MIO formulations are equally strong •  Can use LP relaxations for very large scale problems



•  Not all valid MIO formulations are equally strong •  Can use LP relaxations for very large scale problems •  Associa8on rules historically plagued by “combinatorial explosion”...

Ordered Rules for Classifica8on •  Minimize misclassifica8on error, regularize by height of the highest null rule.

43

“null rules”: higher one predicts the default class and ends the list.

MIO Learning Algorithm

MIO Learning Algorithm Maximize classificaAon accuracy

MIO Learning Algorithm Maximize classificaAon accuracy

Maximize rank of the highest null rule (regularizaAon)

Experiments

•  Five algorithms – Logis8c Regression (LogReg) – Support Vector Machines / RBF kernel (SVM) – Classifica8on and Regression Trees (CART) – Boosted Decision Trees (AdaBoost) – Ordered Rules for Classifica8on (ORC)

•  Several publicly available datasets (UCI) •  Accuracy averaged over 3 folds

Classifica8on Accuracy

o

~x

o

o o

~x ~x

x

x

~o

yes no

1

0

.26

.47 .92

: :

: :

: :

CART on Tic Tac Toe

CART accuracy = 0.9388715

ORC on Tic Tac Toe

x

x

x

x x x

x

x

x

x

x

x

x

x

x

x

x

x

x x x

x x x

x wins

1 2 3 4 5 6 7 8

9

x wins x wins x wins x wins x wins x wins

x does not win

x wins

ORC accuracy = 1

MONKS Problems 1

•  6 Integer valued features taking values 1,2,3,4 •  Examples are in class 1 if either a1=a2 or a5=1

CART on MONKS Problems 1

•  Examples are in class 1 if either a1=a2 or a5=1

ORC on MONKS Problems 1

•  Examples are in class 1 if either a1=a2 or a5=1

a1=3, a2=3 →1 (33/33)a1=2, a2=2 →1 (30/30)a5=1 →1 (65/65) a1=1, a2=1 →1 (31/31)∅ →−1 (152/288)

•  The bo:om line: You don’t need to sacrifice accuracy to get interpretability.







Outline

current work coming up

Associa8on Rules/ Associa8ve Classifica8on

Decision Trees

Decision Lists

Logical Analysis of Data (LAD)

Bayesian Analysis ML algorithms that use rules as features

Current Work •  Machine Learning for the NYC Power Grid

–  cover of IEEE Computer, spotlight issue for IEEE TPAMI in February, WIRED Science, Slashdot, US News & World Report...

•  Supervised Ranking, Equivalences between Ranking and Classifica8on, Ranking with MIO

•  Reverse-‐Engineering Quality Rankings –  in Businessweek last week

•  ML algorithms that understand how they will be used for a subsequent task

•  Several other projects

Thank you!

Documents

Modeling)with)Rules) - ima.umn.edu · dyspepsia&)epigastric)pain) )heartburn) depressionhighbloodpressure) ... • Can use LP relaxations for very large scale problems! Mixed Integer