1
High Throughput Target Identification
Stan Young, NISS
Doug Hawkins, U Minnesota
Christophe Lambert, Golden Helix
Machine Learning, Statistics, and Discovery
25 June 03
2
PublicationYear
All Journals PNAS
1992 0 01993 0 01994 0 01995 4 01996 3 11997 8 21998 37 11999 134 82000 409 342001 773 46
Micro Array Literature
3
Guilt by Association :
You are known
by the company you keep.
4
Data Matrix
Goal: Associations over the genes.
Guilty Gene
Genes
Tissues
5
Goals
1. Associations.
2. Deep associations – beyond 1st level correlations.
3. Uncover multiple mechanisms.
6
Problems
1. n < < p
2. Strong correlations.
3. Missing values.
4. Non-normal distributions.
5. Outliers.
6. Multiple testing.
7
Technical Approach
1. Recursive partitioning.
2. Resampling-based, adjusted p-values.
3. Multiple trees.
8
Recursive Partitioning
Tasks
1. Create classes.
2. How to split.
3. How to stop.
9
Differences:
Recursive Partitioning• Top-down analysis• Can use any type of descriptor.• Uses biological activities to
determine which features matter.
• Produces a classification tree for interpretation and prediction.
• Big N is not a problem!• Missing values are ok.• Multiple trees, big p is ok.
Clustering• Often bottom-up
• Uses “gestalt” matching.
• Requires an external method for determining the right feature set.
• Difficult to interpret or use for prediction.
• Big N is a severe problem!!
10
Forming Classes, Categories, Groups
Profession Av. Income
Baseball Players 1.5MFootball Players 1.2M
Doctors .8MDentists .5M
Lawyers .23MProfessors .09M
. . . . .
11
Forming Classes from “Continuous” Descriptor
0 31 2 4 5 6-1-2-3
How many “cuts” and where to make them?
12
Splitting : t-test
n = 1650ave = 0.34sd = 0.81
n = 1614ave = 0.29sd = 0.73
n = 36ave = 2.60sd = 0.9
Signal 2.60 - 0.29t = = = 18.68Noise 0.734 1 1
36 1614+
TT: NN-CCNN-CC
rP = 2.03E-70
aP = 1.30E-66
13
Splitting : F-test
n = 1650ave = 0.34sd = 0.81
n = 1553ave = 0.21sd = 0.73
n = 36ave = 2.60sd = 0.9
n = 61ave = 1.29sd = 0.83
n = 61ave = 1.29sd = 0.83
Signal Among Var (Xi. - X..)2/df1F = = =
Noise Within Var (Xij - Xi.)2/df2
14
How to Stop
Examine each current terminal node.
Stop if no variable/class has a
significant split, multiplicity adjusted.
15
Levels of Multiple Testing
1. Raw p-value.
2. Adjust for class formation, segmentation.
3. Adjust for multiple predictors.
4. Adjust for multiple splits in the tree.
5. Adjust for multiple trees.
16
Understanding observations
NB: Splitting variables govern the process,NB: Splitting variables govern the process, linked to response variable.linked to response variable.
MultipleMechanisms
Conditionally important descriptors.
17
Multiple Mechanisms
18
Reality: Example Data
60 Tissues
1453 Genes
Gene 510 is the “guilty” gene, the Y.
19
1st Split of Gene 510 (Guilty Gene)
20
Split Selection
14 spliters
with adjusted
p-value
< 0.05
21
Histogram
Non-normal, hence
resampling p-values
make sense.
22
Resampling-based Adjusted p-value
23
Single Tree RP Drawbacks
• Data greedy.
• Only one view of the data. May miss other mechanisms.
• Highly correlated variables may be obscured.
• Higher order interactions may be masked.
• No formal mechanisms for follow-up experimental design.
• Disposition of outliers is difficult.
24
Etc.
Multiple Trees, how and why?Multiple Trees, how and why?
25
How do you get multiple trees?
1. Bootstrap the sample, one tree per sample.
2. Randomize over valid splitters.
Etc.
26
RandomTreeBrowsing,
1000 Trees.
27
Example Tree
28
1st Split
29
Example Tree, 2nd Split
30
Conclusion for Gene G510
If G518 < -0.56
and
G790 < -1.46
then
G510 = 1.10 +/- 0.30
31
Using Multiple Trees to Understand variables
• Which variables matter?
• How to rank variables in importance.
• Correlations.
• Synergistic variables.
32
CorrelationInteractionMatrix
Red=Syn.
33
Summary
• Review recursive partitioning.
• Demonstrated multiple tree RP’s capabilities– Find associated genes
– Group correlated predictors (genes)
– Synergistic predictors (genes that predict together)
• Used to understand a complex data set.
34
Needed research
• Real data sets with known answers.
• Benchmarking.
• Linking to gene annotations.
• Scale (1,000*10,000).
• Multiple testing in complex data sets.
• Good visualization methods.
• Outlier detection for large data sets.
• Missing values. (see NISS paper 123)
35
Teams
NC State University :Jacqueline Hughes-OliverKatja Rimlinger
U Waterloo :Will WelchHugh ChipmanMarcia WangYan Yuan
U. Minnesota :Douglas Hawkins NISS :
Alan Karr(Consider post docs)GSK :
Lei ZhuRay Lam
36
References/Contact
1. www.goldenhelix.com.
2. www.recursive-partitioning.com.
3. www.niss.org, papers 122 and 123.
5. GSK patent.
37
Questions