47
Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Embed Size (px)

Citation preview

Page 1: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Multifactor Dimensionality ReductionLaura MustavichIntroduction to Data MiningFinal Project PresentationApril 26, 2007

Page 2: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

The Inspiration For a Method

Page 3: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

The Nature of Complex Diseases

Most common diseases are complex Caused by multiple genes Often interacting with one another

This interaction is termed Epistasis

Page 4: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Epistasis

When an allele at one locus masks the effect of an allele at another locus

Page 5: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

The Failure of Traditional Methods

Traditional gene hunting methods successful for rare Mendelian (single gene) diseases

Unsuccessful for complex diseases:Since many genes interact to cause the

disease, the effect of any single gene is too small to detect

They do not take this interaction into account

Page 6: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

MDR: The Algorithm

Page 7: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Multifactor Dimensionality Reduction A data mining approach to identify

interactions among discrete variables that influence a binary outcome

A nonparametric alternative to traditional statistical methods such as logistic regression

Driven by the need to improve the power to detect gene-gene interactions

Page 8: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Multifactor Dimensionality Reduction

Page 9: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

MDR Step 0

Divide data (genotypes, discrete environmental factors, and affectation status) into 10 distinct subsets

Page 10: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Multifactor Dimensionality Reduction

Page 11: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

MDR Step 1

Select a set of n genetic or environmental factors (which are suspected of epistasis together) from the set of all variables in the training set

Page 12: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Multifactor Dimensionality Reduction

Page 13: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

MDR Step 2

Create a contingency table for these multilocus genotypes, counting the number of affected and unaffected individuals with each multilocus genotype

Page 14: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Multifactor Dimensionality Reduction

Page 15: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

MDR Step 3

Calculate the ratio of cases to controls for each multilocus genotype

Page 16: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Multifactor Dimensionality Reduction

Page 17: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

MDR Step 4

Label each multilocus genotype as “high-risk” or “low-risk”, depending on whether the case-control ratio is above a certain threshold

****This is the dimensionality reduction step Reduces n-dimensional space to 1 dimension with 2 levels

Page 18: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Multifactor Dimensionality Reduction

Page 19: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

MDR Step 5

Use labels to classify individuals as cases or controls, and calculate the misclassification rate

Page 20: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Multifactor Dimensionality Reduction

Page 21: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Repeat steps 1-5 for: All possible combinations of n factors All possible values of n Across all 10 training and testing sets

Page 22: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

The Best Model

Minimizes prediction error:the average misclassification rate across all the 10 cross-validation subsets

Maximizes cross-validation consistency:the number of times a particular model was the best model across cross-validation subsets

Page 23: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Hypothesis test of best model:

Evaluate magnitude of cross-validation consistency and prediction error estimates bypermutation testing: Randomize disease labels Repeat MDR analysis several times to get distribution

of cross-validation consistencies and prediction errors Use distributions to determine p-values for your actual

cross-validation consistencies and prediction errors

Page 24: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Permutation Testing: An illustration

Sample Quantiles:An Example Empirical Distribution

Fre

quen

cy

0.2 0.4 0.6 0.8 1.0

02

46

810

0% 0.045754

25% 0.168814

50% 0.237763

75% 0.321027

90% 0.423336

95% 0.489813

99% 0.623899

99.99% 0.872345

100% 1

0.4500

The probability that we would see results as, or more, extreme than 0.4500, simply by chance, is between5% and 10%

Page 25: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Strengths

Facilitates simultaneous detection and characterization of multiple genetic loci associated with a discrete clinical endpoint by reducing the dimensionality of the multilocus data

Non-parametric – no values are estimated Assumes no particular genetic model False-positive rate is minimized due to multiple

testing

Page 26: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Weaknesses

Computationally intensive

(especially with >10 loci) The curse of dimensionality:

decreased predictive ability with high dimensionality and small sample due to cells with no data

Page 27: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

MDR Software

Page 28: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

The Authors

Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Hahn, Ritchie, Moore, 2003.

www.sourceforge.net

Page 29: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007
Page 30: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Values Calculated by MDR

Measure Formula/InterpretationBalanced Accuracy (Sensitivity+Specificity)/2; fitness measure

Accuracy is skewed in favor of the larger class, whereas balanced accuracy gives equal weight to each class

Accuracy (TP+TN)/(TP+TN+FP+FN)

Proportion of instances correctly classified

Sensitivity TP/(TP+FN); proportion of actual positives correctly classified

Specificity TN/(TN+FP); proportion of actual negatives correctly classified

Odds Ratio (TP*TN)/(FP*FN); compares whether the probability of a certain event is the same for two groups

X2 Chi-squared score for the attribute constructed by MDR from this attribute combination

Precision TP/(TP+FP); the proportion of relevant cases returned

Kappa 2(TP*TN+FP*FN)/[(TP+FN)(FN+TN)+(TP+FP)*(FP+TN)]

A function of total accuracy and random accuracy

F-Measure 2*TP/(2*TP+FP+FN); a function of sensitivity and precision

Page 31: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Sign Test

n = number of cross-validation intervals

C = number of cross-validation intervals with testing accuracy ≥ 0.5

The probability of observing c or more cross-validation intervals with testing accuracy ≥ 0.5 if each case were actually classified randomly

nn

ck k

np

2

1

Page 32: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

The Problem of Alcoholism

A Case Study

Page 33: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Genes Associated With Alcoholism

ADH enzymes

ALDH2 enzyme

Alcohol

Acetaldehyde

Acetate

ADH(alcohol dehydrogenase)

and

ALDH2(acetaldehyde dehydrogenase 2)

genes are

associated with alcoholism

involved in alcohol metabolism

Page 34: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

ADH Genes

ADH7 ADH6 ADH4 ADH5ADH1B ADH1AADH1C5’ 3’

Class IV Class IClass V Class II Class III

370 kb

Chromosome 4

Page 35: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Taste Receptors and Aversion to Alcohol

PTC

TAS2R38

Tasters Non-Tasters

Alcohol Tastes Bitter Alcohol Tastes Sweet

Drink Less Alcohol Drink More Alcohol

• a person must be willing to drink in order to be an alcoholic

• TAS2R38 affects the amount of alcohol a person is willing to drink

• therefore, it is related to alcoholism, although no direct association has been found

• we hope to provide a direct link between TAS2R38 and alcoholism, by demonstrating that it acts epistatically with other genes associated with alcoholism

Page 36: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Actual Analysis

Page 37: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Data

A sample of cases and controls (alcoholics and non-alcoholics) from three East Asian populations: the Ami, Atayal, and Taiwanese

Genotyped for 98 markers within several genes: ALDH2, all ADH genes, and 2 taste receptor genes, TAS2R16 and TAS2R38 (PTC)

Page 38: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Computational Limitations

1. The software package has a problem reading missing data

I was forced to use only complete records, dwindling my (already small) sample to 79 complete records

Page 39: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Computational Limitations

2. The computation time is way too long for higher order models, especially for high numbers of attributes

I was advised to restrict my attributes to markers within ADHIC, and the 2 taste receptor genes, which left me with 36 attributes

I considered models only up to order 4

Page 40: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Summary of Results: All Populations

Order Model Training Bal. Acc. Testing Bal. Acc. Sign Test (p) CV Consistency

1 X.04..ADH1C.dwstrm.Te 0.6049 0.4278 0 (1.0000) 5/10

2X.07..TAS2R16.C_11431 X.04..ADH1C.dwstrm.Te

0.7076 0.4438 3 (0.9453) 6/10

3X.07..TAS2R16.C_11431 X.04..ADH1C.dwstrm.Te X.04..ADH1C.rs3762896

0.785 0.3186 1 (0.9990) 4/10

4

X.07..TAS2R16.C_11431 X.07..PTC.C_8876291_1 X.07..PTC.C_8876482_1 X.04..ADH1C.dwstrm.Te

0.8453 0.3564 2 (0.9893) 6/10

Instances: 79 Attributes: 36 Ratio: 1.3235

Page 41: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Summary of Results: Ami

Order Model Training Bal. Acc. Testing Bal. Acc. Sign Test (p) CV Consistency

1 X.07..TAS2R16.C_11431 0.7331 0.4598 5 (0.6230) 5/10

2X.07..TAS2R16.C_11431 X.04..ADH1C.C_2688508

0.8284 0.3476 2 (0.9893) 3/10

3X.07..TAS2R16.C_11431 X.07..PTC.C_8876467_1 X.04..ADH1C.C_2688508

0.9688 0.9545 10 (0.0010) 10/10

4

X.07..TAS2R16.C_11431 X.07..TAS2R16.C_11431.1 X.07..PTC.C_8876467_1 X.04..ADH1C.C_2688508

0.9722 0.8712 8 (0.0547) 9/10

Instances: 30 Attributes: 36 Ratio: 0.8750

Page 42: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Cross Validation Statistics  Set

Measure Training Testing

Balanced Accuracy 0.9688 0.9545

Accuracy 0.9667 0.95

Sensitivity 1 1

Specificity 0.9375 0.9091

Odds Ratio ∞ ∞

χ2 23.6250 (p < 0.0001) 1.6364 (p = 0.2008)

Precision 0.9333 0.9

Kappa 0.9333 0.9

F-Measure 0.9655 0.9474

Sign Test: 10 (p = 0.0010)Cross-validation Consistency: 10/10

Page 43: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Whole Dataset Statistics:

Training Balanced Accuracy: 0.9688 Training Accuracy: 0.9667 Training Sensitivity: 1.0000 Training Specificity: 0.9375 Training Odds Ratio: ∞ Training Χ²: 26.2500 (p < 0.0001) Training Precision: 0.9333 Training Kappa: 0.9333 Training F-Measure: 0.9655

Page 44: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Graphical Model

Page 45: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Classification Rules

X.07..TAS2R16.C_11431 X.07..PTC.C_8876467_1 X.04..ADH1C.C_2688508 Class

IF

A\A

AND

C\G

AND

C\C

THEN

0

A\A C\G C\T 1

A\A C\G T\T 0

A\A G\G C\C 0

A\A G\G C\T 0

A\A G\G T\T 1

A\G C\C C\T 1

A\G C\G C\C 0

A\G C\G C\T 0

A\G C\G T\T 0

A\G G\G C\C 1

A\G G\G C\T 1

A\G G\G T\T 0

G\G C\G C\T 1

G\G G\G C\C 0

G\G G\G C\T 1

G\G G\G T\T 1

Page 46: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Locus Dendrogram

Page 47: Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

Future Work

Simulations to calculate the power of MDR, especially in relation to sample size

Comparison of MDR with logistic regression, and other proposed methods to detect epistasis, with respect to the current data set and simulated data

Research how different methods to search the sample space can be incorporated into MDR implementation to improve computational feasibility