79
Data Mining: An Overview David Madigan [email protected] http://stat.rutgers.edu/~madigan

Data Mining: An Overview David Madigan [email protected] madigan

Embed Size (px)

Citation preview

Page 1: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Data Mining: An Overview

David Madigan

[email protected]

http://stat.rutgers.edu/~madigan

Page 2: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Overview•Brief Introduction to Data Mining•Data Mining Algorithms•Specific Examples

–Algorithms: Disease Clusters–Algorithms: Model-Based Clustering–Algorithms: Frequent Items and

Association Rules

•Future Directions, etc.

Page 3: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Of “Laws”, Monsters, and Giants…• Moore’s law: processing “capacity” doubles

every 18 months : CPU, cache, memory• It’s more aggressive cousin:

– Disk storage “capacity” doubles every 9 months

1E+3

1E+4

1E+5

1E+6

1E+7

1988 1991 1994 1997 2000

disk TB growth: 112%/y

Moore's Law: 58.7%/y

ExaByte

Disk TB Shipped per Year1998 Disk Trend (J im Porter)

http://www.disktrend.com/pdf/portrpkg.pdf.What do the two “laws” combined produce?

A rapidly growing gap between our ability to generate data, and our ability to make use of it.

Page 4: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

What is Data Mining?

Finding interesting structure in data

• Structure: refers to statistical patterns, predictive models, hidden relationships

• Examples of tasks addressed by Data Mining– Predictive Modeling (classification,

regression)– Segmentation (Data Clustering )– Summarization– Visualization

Page 5: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan
Page 6: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan
Page 7: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Ronny Kohavi, ICML 1998

Page 8: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Ronny Kohavi, ICML 1998

Page 9: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Ronny Kohavi, ICML 1998

Page 10: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Stories: Online Retailing

Page 11: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Chapter 4: Data Analysis and Uncertainty

•Elementary statistical concepts: random variables, distributions, densities, independence, point and interval estimation, bias & variance, MLE

•Model (global, represent prominent structures) vs. Pattern (local, idiosyncratic deviations):

•Frequentist vs. Bayesian

•Sampling methods

Page 12: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Bayesian Estimatione.g. beta-binomial model:

11

11

)1(

)1()1(

)()|()|(

rnr

rnr

pDpDp

Predictive distribution:

dDpnxp

dDnxpDnxp

)|()|)1((

)|),1(()|)1((

Page 13: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Issues to do with p-values

•Using threshold’s of 0.05 or 0.01 regardless of sample size

•Multiple testing (e.g. Friedman (1983) selecting highly significant regressors from noise)

•Subtle interpretation Jeffreys (1980): “I have always considered the arguments for the use of P absurd. They amount to saying that a hypothesis that may or may not be true is rejected because a greater departure from the trial value was improbable; that is, that is has not predicted something that has not happened.”

Page 14: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

p-value as measure of evidence

Schervish (1996): “if hypothesis H implies hypothesis H', then there should be at least as much support for H' as for H.”

- not satisfied by p-values

Grimmet and Ridenhour (1996): “one might expect an outlying data point to lend support to the alternative hypothesis in, for instance, a one-way analysis of variance.”

- the value of the outlying data point that minimizes the significance level can lie within the range of the data

Page 15: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Chapter 5: Data Mining Algorithms

“A data mining algorithm is a well-defined procedure that takes data as input and produces output in the form of models or patterns”

“well-defined”: can be encoded in software

“algorithm”: must terminate after some finite number of steps

Page 16: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Data Mining Algorithms

“A data mining algorithm is a well-defined procedure that takes data as input and produces output in the form of models or patterns”

“well-defined”: can be encoded in software

“algorithm”: must terminate after some finite number of steps

Hand, Mannila, and Smyth

Page 17: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Algorithm Components1. The task the algorithm is used to address (e.g. classification, clustering, etc.)

2. The structure of the model or pattern we are fitting to the data (e.g. a linear regression model)

3. The score function used to judge the quality of the fitted models or patterns (e.g. accuracy, BIC, etc.)

4. The search or optimization method used to search over parameters and/or structures (e.g. steepest descent, MCMC, etc.)

5. The data management technique used for storing, indexing, and retrieving data (critical when data too large to reside in memory)

Page 18: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan
Page 19: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Backpropagation data mining algorithm

x1

x2

x3

x4

h1

h2

y

•vector of p input values multiplied by p d1 weight matrix

•resulting d1 values individually transformed by non-linear function

•resulting d1 values multiplied by d1 d2 weight matrix

4 2

1

ii iii i xsxs

4

12

4

11 ;

)1(1 isi esh

ii ihwy

2

1

Page 20: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Backpropagation (cont.)Parameters: 214141 ,,,,,,, ww

Score:

n

iSSE iyiyS

1

2))(ˆ)((

Search: steepest descent; search for structure?

Page 21: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Models and Patterns

Models

PredictionProbability Distributions

Structured Data

•Linear regression

•Piecewise linear

Page 22: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Models

PredictionProbability Distributions

Structured Data

•Linear regression

•Piecewise linear

•Nonparamatric regression

Page 23: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan
Page 24: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Models

PredictionProbability Distributions

Structured Data

•Linear regression

•Piecewise linear

•Nonparametric regression

•Classification

logistic regression

naïve bayes/TAN/bayesian networks

NN

support vector machines

Trees

etc.

Page 25: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Models

PredictionProbability Distributions

Structured Data

•Linear regression

•Piecewise linear

•Nonparametric regression

•Classification

•Parametric models

•Mixtures of parametric models

•Graphical Markov models (categorical, continuous, mixed)

Page 26: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Models

PredictionProbability Distributions

Structured Data

•Linear regression

•Piecewise linear

•Nonparametric regression

•Classification

•Parametric models

•Mixtures of parametric models

•Graphical Markov models (categorical, continuous, mixed)

•Time series

•Markov models

•Mixture Transition Distribution models

•Hidden Markov models

•Spatial models

Page 27: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Markov Models

First-order:

T

ttttT yypypyyp

21111 )|()(),,(

e.g.: 2

11

)(

2

1exp

2

1)|(

tttt

ygyyyp

g linear standard first-order auto-regressive modeleyy tt 110 ),0(~ 2Ne

y1 y2 y3yT

Page 28: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

First-Order HMM/Kalman Filter

x1 x2 x3xT

y1 y2 y3yT

T

tttttTT xxpxypxypxpxxyyp

211111111 )|()|()|()(),,,,,(

Note: to compute p(y1,…,yT) need to sum/integrate over all possible state sequences...

Page 29: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Bias-Variance Tradeoff

High Bias - Low Variance

Low Bias - High Variance

“overfitting” - modeling the random component

Score function should embody the compromise

Page 30: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

The Curse of DimensionalityX ~ MVNp (0 , I)

•Gaussian kernel density estimation

•Bandwidth chosen to minimize MSE at the mean

•Suppose want: 01.0

)(

))()(ˆ[(2

2

xxp

xpxpE

Dimension # data points 1 4 2 19 3 67 6 2,790 10 842,000

Page 31: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Patterns

Global Local

•Clustering via partitioning

•Hierarchical Clustering

•Mixture Models

•Outlier detection

•Changepoint detection

•Bump hunting

•Scan statistics

•Association rules

Page 32: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

xx

x

x xxx x xxxx xx

x

x

xxx xxxx

x

xxxxxx xx x xxx

The curve represents a road

Each “x” marks an accident

Red “x” denotes an injury accident

Black “x” means no injury

Is there a stretch of road where there is an unusually large fraction of injury accidents?

Scan Statistics via Permutation Tests

Page 33: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Scan with Fixed Window

• If we know the length of the “stretch of road” that we seek, e.g., we could slide this window long the road and find the most “unusual” window location

xx

x

x xxx x xxxx xx

x

x

xxx xxxx

x

xxxxxx xx x xxx

Page 34: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

How Unusual is a Window?

• Let pW and p¬W denote the true probability of being red inside and outside the window respectively. Let (xW

,nW) and (x¬W ,n¬W) denote the corresponding counts

• Use the GLRT for comparing H0: pW = p¬W versus H1: pW ≠ p¬W

WWWWWW

WWWWWW

xnWW

xWW

xnWW

xWW

xxnnWWWW

xxWWWW

nxnxnxnx

nnxxnnxx

)]/(1[)/()]/(1[)/(

))]/()((1[)]/()[(

2 log here has an asymptotic chi-square distribution with 1df

• lambda measures how unusual a window is

Page 35: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Permutation Test• Since we look at the smallest over all

window locations, need to find the distribution of smallest- under the null hypothesis that there are no clusters

• Look at the distribution of smallest- over say 999 random relabellings of the colors of the x’sxx x xxx x xx x xx x 0.376

xx x xxx x xx x xx x 0.233

xx x xxx x xx x xx x 0.412

xx x xxx x xx x xx x 0.222

smallest-

• Look at the position of observed smallest- in this distribution to get the scan statistic p-value (e.g., if observed smallest- is 5th smallest, p-value is 0.005)

Page 36: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Variable Length Window

• No need to use fixed-length window. Examine all possible windows up to say half the length of the entire road

O = fatal accidentO = non-fatal accident

Page 37: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Spatial Scan Statistics• Spatial scan statistic uses, e.g., circles

instead of line segments

Page 38: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan
Page 39: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Spatial-Temporal Scan Statistics

• Spatial-temporal scan statistic use cylinders where the height of the cylinder represents a time window

Page 40: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Other Issues

• Poisson model also common (instead of the bernoulli model)

• Covariate adjustment• Andrew Moore’s group at CMU:

efficient algorithms for scan statistics

Page 41: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Software: SaTScan + others

http://www.satscan.org

http://www.phrl.org http://www.terraseer.com

Page 42: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Association Rules: Support and Confidence

• Find all the rules Y Z with minimum confidence and support– support, s, probability that a

transaction contains {Y & Z}– confidence, c, conditional

probability that a transaction having {Y & Z} also contains Z

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Let minimum support 50%, and minimum confidence 50%, we have– A C (50%, 66.6%)– C A (50%, 100%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

Page 43: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Mining Association Rules—An Example

For rule A C:support = support({A &C}) = 50%confidence = support({A &C})/support({A}) =

66.6%

The Apriori principle:Any subset of a frequent itemset must be frequent

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

Page 44: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Mining Frequent Itemsets: the Key Step

• Find the frequent itemsets: the sets of items that have minimum support– A subset of a frequent itemset must also be a

frequent itemset• i.e., if {AB} is a frequent itemset, both {A} and {B}

should be a frequent itemset

– Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)

• Use the frequent itemsets to generate association rules.

Page 45: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

The Apriori Algorithm

• Join Step: Ck is generated by joining Lk-1with itself

• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

• Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

Page 46: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

The Apriori Algorithm — Example

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Page 47: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Association Rule Mining: A Road Map

• Boolean vs. quantitative associations (Based on the types of values handled)– buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x,

“DBMiner”) [0.2%, 60%]– age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”)

[1%, 75%]

• Single dimension vs. multiple dimensional associations (see ex. Above)

• Single level vs. multiple-level analysis– What brands of beers are associated with what brands of

diapers?

• Various extensions (thousands!)

Page 48: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan
Page 49: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Model-based Clustering

K

kkkk xfxf

1

);()(

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4ANEMIA PATIENTS AND CONTROLS

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

Padhraic Smyth, UCI

Page 50: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

3.3 3.4 3.5 3.6 3.7 3.8 3.9 4

3.7

3.8

3.9

4

4.1

4.2

4.3

4.4ANEMIA PATIENTS AND CONTROLS

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

Page 51: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

3.3 3.4 3.5 3.6 3.7 3.8 3.9 4

3.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 1

Page 52: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

3.3 3.4 3.5 3.6 3.7 3.8 3.9 4

3.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 3

Page 53: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

3.3 3.4 3.5 3.6 3.7 3.8 3.9 4

3.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 5

Page 54: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

3.3 3.4 3.5 3.6 3.7 3.8 3.9 4

3.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 10

Page 55: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

3.3 3.4 3.5 3.6 3.7 3.8 3.9 4

3.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 15

Page 56: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

3.3 3.4 3.5 3.6 3.7 3.8 3.9 4

3.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 25

Page 57: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Mixtures of {Sequences, Curves, …}

k

K

k

kii cDpDp

1

)|()(

Generative Model

- select a component ck for individual i

- generate data according to p(Di | ck)

- p(Di | ck) can be very general

- e.g., sets of sequences, spatial patterns, etc

[Note: given p(Di | ck), we can define an EM algorithm]

Page 58: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Example: Mixtures of SFSMs

Simple model for traversal on a Web site(equivalent to first-order Markov with end-

state)

Generative model for large sets of Web users- different behaviors <=> mixture of SFSMs

EM algorithm is quite simple: weighted counts

Page 59: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

WebCanvas: Cadez, Heckerman, et al, KDD 2000

Page 60: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan
Page 61: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan
Page 62: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan
Page 63: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Discussion

• What is data mining? Hard to pin down – who cares?

• Textbook statistical ideas with a new focus on algorithms

• Lots of new ideas too

Page 64: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Privacy and Data Mining

Ronny Kohavi, ICML 1998

Page 65: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Analyzing Hospital Discharge Data

David MadiganRutgers University

Page 66: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Comparing Outcomes Across Providers

•Florence Nightingale wrote in 1863:“In attempting to arrive at the truth, I have applied everywhere for information, but in scarcely an instance have I been able to obtain hospital records fit for any purposes of comparison…I am fain to sum up with an urgent appeal for adopting some uniform system of publishing the statistical records of hospitals.”

Page 67: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Data• Data of various kinds are now available; e.g.

data concerning all medicare/medicaid hospital admissions in standard format UB-92; covers >95% of all admissions nationally

• Considerable interest in using these data to compare providers (hospitals, physician groups, physicians, etc.)

• In Pennsylvannia, large corporations such as Westinghouse and Hershey Foods are a motivating force and use the data to select providers.

Page 68: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

SYSID DCSTATUS PPXDOW CANCER1

YEAR LOS SPX1DOW CANCER2

QUARTER DCHOUR SPX2DOW MDCHC4

PAF DCDOW SPX3DOW MQSEV

HREGION ECODE SPX4DOW MQNRSP

MAID PDX SPX5DOW PROFCHG

PTSEX SDX1 REFID TOTALCHG

ETHNIC SDX2 ATTID NONCVCHG

RACE SDX3 OPERID ROOMCHG

PSEUDOID SDX4 PAYTYPE1 ANCLRCHG

AGE SDX5 PAYTYPE2 DRUGCHG

AGECAT SDX6 PAYTYPE3 EQUIPCHG

PRIVZIP SDX7 ESTPAYER SPECLCHG

MKTSHARE SDX8 NAIC MISCCHG

COUNTY PPX OCCUR1 APRMDC

STATE SPX1 OCCUR2 APRDRG

ADTYPE SPX2 BILLTYPE APRSOI

ADSOURCE SPX3 DRGHOSP APRROM

ADHOUR SPX4 PCMU MQGCLUST

ADMDX SPX5 DRGHC4 MQGCELL

ADDOW  

Pennsylvannia Healthcare Cost Containment Council. 2000-1, n=800,000

Page 69: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Risk Adjustment• Discharge data like these allow for

comparisons of, e.g., mortality rates for CABG procedure across hospitals.

• Some hospitals accept riskier patients than others; a fair comparison must account for such differences.

• PHC4 (and many other organizations) use “indirect standardization”

• http://www.phc4.org

Page 70: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan
Page 71: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Hospital Responses

Page 72: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan
Page 73: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

p-value computation

n=463; suppose actual number of deaths=40

e=29.56;p-value =

02.0)46356.291()46356.29(463 463

463

35

jj

j j

p-value < 0.05

Page 74: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Concerns

• Ad-hoc groupings of strata• Adequate risk adjustment for

outcomes other than mortality? Sensitivity analysis? Hopeless?

• Statistical testing versus estimation

• Simpson’s paradox

Page 75: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Risk Cat.

N Rate ActualNumber

Expected Number

Low 800 1% 8 8 (1%)

High 200 8% 16 10 (5%)

Low 200 1% 2 2 (1%)

High 800 8% 64 40 (5%)

A

B

SMR = 24/18 = 1.33; p-value = 0.07

SMR = 66/42 = 1.57; p-value = 0.0002

Page 76: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Hierarchical Model

• Patients -> physicians -> hospitals• Build a model using data at each

level and estimate quantities of interest

Page 77: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Bayesian Hierarchical Model

MCMC via WinBUGS

Page 78: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Goldstein and Spiegelhalter, 1996

Page 79: Data Mining: An Overview David Madigan dmadigan@rci.rutgers.edu madigan

Discussion

• Markov chain Monte Carlo + compute power enable hierarchical modeling

• Software is a significant barrier to the widespread application of better methodology

• Are these data useful for the study of disease?