Upload
madison-mcdowell
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
X 00010010110000101011000111
R2ECCR @ NCSU
________________________________________________
This work was funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant 1 P20 HG003900-01. Information on the Molecular Libraries Roadmap Initiative can be obtained from http://nihroadmap.nih.gov/molecularlibraries/
Jacqueline M. Hughes-Oliver
Department of Statistics
North Carolina State University
*joint with Ke Zhang, GSK and Stan Young, NISS
Analysis of High-Dimensional Structure-Activity Screening
Datasets Using the Optimal Bit String Tree
X 00010010110000101011000111
R2ECCR @ NCSU
2Blackwell-Tapia - November 2008
Outline Background Recursive partitioning OBSTree Simulation study Screening for monoamine oxidase inhibitors Summary
X 00010010110000101011000111
R2ECCR @ NCSU
3Blackwell-Tapia - November 2008
BackgroundEstimate a function such that
based on
where
Preferably,
},...,1,0{}1,0{: Mf p f
niXMYXY piiii ,,1}1,0{},,1,0{),(
psMf
ffff
ls
l
Ll
l
},,1,0{}1,0{:
),,,,( 1
0|0ˆ0|0ˆ YfYf costs more than
X 00010010110000101011000111
R2ECCR @ NCSU
4Blackwell-Tapia - November 2008
http://pubchem.ncbi.nlm.nih.gov/
http://eccr.stat.ncsu.edu/
http://www.niss.org/PowerMV/
X 00010010110000101011000111
R2ECCR @ NCSU
5Blackwell-Tapia - November 2008
AD:Tricyclic:Amitriptyline
N
Background – Structure-Activity Relationship (SAR)
• Willett, Barnard, Downs (1998 JCICS)• Molecular descriptors—Carhart atom pairs
– Atom type—distance—atom type, e.g., C(2,1)-04-C(3,1)– Binary descriptors—few turned on
X 00010010110000101011000111
R2ECCR @ NCSU
6Blackwell-Tapia - November 2008
TrueFalse
X3=1
Splitting variable chosen to optimize “purity measure”
Search space: size p
Need definitions for:search spacepurity measure, splitting criterionstopping criterion
Recursive Partitioning
TrueFalse
X27=1
X 00010010110000101011000111
R2ECCR @ NCSU
7Blackwell-Tapia - November 2008
Recursive Partitioning: Rules are complex
0
12
3
17
18 3
9 19
6 3 15
1 16
9 11
6 2
5
8 4
5
12
1
8
13
3
7
0 0
000
0
0
0 0
0
0
00
0
0
0
00
0
• Are all splits necessary for the activity mechanism?• Does an early split impede identification of other mechanisms?
X 00010010110000101011000111
R2ECCR @ NCSU
8Blackwell-Tapia - November 2008
Need definitions for:• Search space• Purity measure, splitting criterion• Stopping rule
Binary Formal Inference-Based Recursive Modeling (BFIRM)• Cho, Shen, Hermsmeier (2000, JCICS)• Rank predictors according to F-test• Combine important predictors to form splitting variable• Result is better QSAR rules
Recursive Partitioning/Simulated Annealing (RP/SA)• Blower et al. (2002, JCICS)• Best single predictor not necessarily best in combination
Tree Harvesting• Yuan, Chipman, Welch (2006 tech report)• “Trim” bits off each terminal node
Recursive Partitioning: Focus of Study
X 00010010110000101011000111
R2ECCR @ NCSU
9Blackwell-Tapia - November 2008
Recursive Partitioning: RP/SA• Splitting variables are based on a combination of K predictors• Features are always present:
• Search space of size
• Uses simulated annealing – stochastic optimization• K is held fixed for all splits, and is assumed known
20102
10
500
K
p
1,,1,121
Kjjj XXX
X 00010010110000101011000111
R2ECCR @ NCSU
10Blackwell-Tapia - November 2008
OBSTree• Splitting variables are based on a combination of K predictors• Combine approaches of BFIRM and RP/SA• Features can be present or absent: chromosome selection
• Search space of size
• Uses simulated annealing + weighted sampling + trimming• “K” can change for all splits, and is assumed unknown• Uses a penalty entropy splitting criterion• Usual stopping criteria applied, including cross validation
2310 102210
5002
K
K
p
1,,1,0,0,14321
Kjjjjj XXXXX
X 00010010110000101011000111
R2ECCR @ NCSU
11Blackwell-Tapia - November 2008
Pre-OBSTree Setup Remove unary
descriptors Determine Singly
Important group Specify parameters
OBSTree: Flowchart
Descriptor Pool
RP
Singly Important Descriptors
General Descriptors
X 00010010110000101011000111
R2ECCR @ NCSU
12Blackwell-Tapia - November 2008
OBSTree: Flowchart
Pre-OBSTree Setup Remove unary
descriptors Determine Singly
Important group Specify parameters
Initialize split at next depth: depth=depth+1 a set of K descriptor (X0) using WSS Determine best chromosome x0 of initial X0
SA to determine “optimal” (XA, xA) for split using WSS
Form last terminal node. STOP
depth=d or node size<2min or Ymax=0 or Ybar>M-1
Yes
No
Trim Check 2K-1 subsets of current (XA, xA) Report best trimmed version as (X*, x*)
Form terminal node
X*=x*?
Yes
No
X 00010010110000101011000111
R2ECCR @ NCSU
13Blackwell-Tapia - November 2008
• Node has N compounds
• Class i has proportion pi in the node, with a total of ni in the node
• Entropy (node impurity):
• Penalty Entropy (penalize unwanted category)
M
iii pp
0
log
Problem:
Entropy=0 (perfect) when a class of junk compounds is identified
form) (general log1
log
compounds)junk (penalize log)1
log1
(1
0
Wiii
Uj
j
M
iii
ppNN
n
ppNN
n
OBSTree: Splitting Criterion
X 00010010110000101011000111
R2ECCR @ NCSU
14Blackwell-Tapia - November 2008
• Maximum depth d
• The most active compound is junk
• The node size is less than 2j (j is the minimum node size).
• 5-fold cross-validation, e.g., choose depth d if– # correct classifications levels off at depth d
– Accept H0: d+1 = 0 for d+1 = sensitivity between depths d and d+1
OBSTree: Stopping Criteria
X 00010010110000101011000111
R2ECCR @ NCSU
15Blackwell-Tapia - November 2008
• 1000 compounds, 500 binary descriptors• Four active groups (20 compounds per group) – 8% active
Activity Mechanisms Potency Descriptor Sets and Chromosomes
I 3 1 2 3 4 5
1 0 1 0 1
II 3 5 6 7 8 9
0 1 1 1 1
III 2 3 11 12 13 17
1 1 1 1 1
IV 1 15 16 17 18 19
1 1 0 1 1
Simulation Study
X 00010010110000101011000111
R2ECCR @ NCSU
16Blackwell-Tapia - November 2008
Simulation Study: Standard RP Tree
0
12
3
17
18 3
9 19
6 3 15
1 16
9 11
6 2
5
8 4
5
12
1
8
13
3
7
0 0
000
0
0
0 0
0
0
00
0
0
0
00
0
5 compounds of 3 + 5 compounds of 0 7 compounds of 3
X 00010010110000101011000111
R2ECCR @ NCSU
17Blackwell-Tapia - November 2008
Simulation Study: Sample OBSTree
0
1,2,3,4,5/1,0,1,0,1
3
1
3
2
15,16,17,18,19/1,1,0,1,1
5,6,7,8,9/0,1,1,1,1
3,11,12,13,17/1,1,1,1,1
X 00010010110000101011000111
R2ECCR @ NCSU
18Blackwell-Tapia - November 2008
Simulation Study: 5-fold Cross-validation Actual Accuracy
0 1 2 3
Prediction 0 918 0 0 5 99.5%
1 0 20 0 0 100%
2 0 0 20 0 100%
3 2 0 0 35 94.6%
Hit 99.7% 100% 100% 87.5% Overall Accuracy: 99.3%
OBSTree
RP
Actual Accuracy
0 1 2 3
Prediction 0 910 1 0 34 96.3%
1 3 19 0 0 86.4%
2 0 0 20 0 100%
3 7 0 0 6 46.2%
Hit 98.9% 95% 100% 15% Overall Accuracy: 93.5%
X 00010010110000101011000111
R2ECCR @ NCSU
19Blackwell-Tapia - November 2008
Simulation Study: Sensitivity Analysis• K, descriptor set size
– K >7 perfectly found all mechanisms– K =7 perfectly found all but one mechanism
• Basic tree parameters– Min node size is 5
• SA parameters– Initial temperature– Minimum temperature– Temperature reduction rate– # transitions at a given temperature– # failures to accept new point before increasing transition counter– Sampling weights in WSS
X 00010010110000101011000111
R2ECCR @ NCSU
20Blackwell-Tapia - November 2008
Screening to Identify MAO Inhibitors• Neuronal MAO deactivates neurotransmitters
• Pargyline, an MAO inhibitor, was used to treat depression• MAO inhibitors no longer used due to toxicity & interactions• Abbott Laboratories dataset of MAO inhibitors
Brown & Martin (1996 JCICS), 1646 chemically diverse compounds 1380 binary 2D atom-pair descriptors Response variable – 0, 1, 2, 3 (ordered data) [1358/114/86/88] Category 3 has 2 well-known mechanisms - Rusinko et al. (1999 JCICS)
X 00010010110000101011000111
R2ECCR @ NCSU
21Blackwell-Tapia - November 2008
0/1/0/6
1/0/1/26
32,572,844
184,721,879
0/0/0/33
0/0/0/15
1, 579,1184,809/1,1,1,0
81,177,579,183/1,1,1,0
2/0/0/32
2/0/5/24
9/2/1/2
704
1184
65
81
OBSTreeOBSTree RP/SA
RP
959/85/55/18 99/1/0/0
183
X 00010010110000101011000111
R2ECCR @ NCSU
22Blackwell-Tapia - November 2008
MAO: Activity Mechanism I • “Irreversible binding to flavin cofactor of MAO”• Pargyline-like compounds• Typical features of pargyline-like compounds
A triple bondA tertiary nitrogenAn aromatic ring
• 1st terminal node of OBSTree
• Highest active terminal node of RP
• 1st terminal node of RP/SA
81 7042/0/0/32
1 1
81 183 177 5790/0/0/33
1 0 1 1
184 721 8790/1/0/6
1 1 1
X 00010010110000101011000111
R2ECCR @ NCSU
23Blackwell-Tapia - November 2008
MAO: Activity Mechanism I
• Compound 1: Pargyline, y=3, has 579 & 81 & 177 but not 183• Compound 2: y=0, has feature 183 so violates OBSTree• Compound 3: y=0, falls in active node from RP• Compound 4: y=0, falls in active node from RP and RP/SA
X 00010010110000101011000111
R2ECCR @ NCSU
24Blackwell-Tapia - November 2008
HO
O
N
N
) C(1,0)-3-C(1,0)579: C(2,1)-3-C(3,1) 1:C(1,0)-3-C(1,0)
1184:N(2,0)-2-N(2,0)
MAO: Activity Mechanism II• “Binding to active site"• –N-N-C(=O)- is a hydrazine feature that can be hydrolyzed to
bind protein (MAO) as a nonselective, irreversible inhibitor
X 00010010110000101011000111
R2ECCR @ NCSU
25Blackwell-Tapia - November 2008
Absent Descriptor (809: C(3,1)-4-Br)
O N
N
Br
O N
N
BrC(3,1)-4-Br
Activity=3 Activity=2
C(3,1)-4-Br
X 00010010110000101011000111
R2ECCR @ NCSU
26Blackwell-Tapia - November 2008
Summary• OBSTree: new RP algorithm for obtaining simplified output
– Model presence and absence of molecular features– Combination size is data-driven, varies over splits– Penalty entropy splitting criterion for one-sided purity– Weighted sampling during optimization allows prior information
• Simpler verification of QSAR• Standard RP and RP/SA are special cases of OBSTree
• Output is not deterministic• As with any RP output, care should be taken when
interpreting the results– Can miss highly correlated but important predictors– Different trees provide similar partitions of the data– Because of hard thresholding, predictions are highly variable
• Computationally intensive!
X 00010010110000101011000111
R2ECCR @ NCSU
27Blackwell-Tapia - November 2008
Acknowledgements
• Atina Brooks, North Carolina State University• Jiajun Liu, Merck• Haojun Ouyang, North Carolina State University• Abbott Laboratories• Jack Liu, OmicSoft• Jun Feng, NIH• GoldenHelix