Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays
J. Tobler, M. Molla, J. ShavlikUniversity of Wisconsin-Madison M. Molla, E. Nuwaysir, R. GreenNimblegen Systems Inc.
probes
surface
Oligonucleotide Microarrays
Specific probes synthesized atknown spot on chip’s surface
Probes complementary to RNA of genes to be measured
Typical gene (1kb+) MUCH longer than typical probe (24 bases)
Probes: Good vs. Bad
good probe
bad probe
Blue = ProbeRed = Sample
Probe-Picking Method Needed
Hybridization characteristics differ between probes
Probe set represents very small subset of gene
Accurate measurement of expression requires good probe set
Related Work
Use known hybridization characteristics
Lockhardt et al. 1996
Melting point (Tm) predictionsKurata and Suyama 1999
Li and Stormo 2001
Stable secondary structureKurata and Suyama 1999
Our Approach
Apply established machine-learning algorithms Train on categorized examples Test on examples with category hidden
Choose features to represent probes
Categorize probes as good or bad
The FeaturesFeature Name Description
fracA, fracC, fracG, fracT The fraction of A, C, G, or T in the 24-mer
fracAA, fracAC, fracAG, fracAT, fracCA, fracCC, fracCG, fracCT, fracGA, fracGC, fracGG, fracGT,fracTA, fracTC, fracTG, fracTT
The fraction of each of these dimers in the 24-mer
n1, n2, …., n24 The particular nucleotide (A, C, G, or T) at the specified position in the 24-mer
d1, d2, …, d23 The particular dimer (AA, AC,…TT) at the specified position in the 24-mer
The Data
Gene Sequence: GTAGCTAGCATTAGCATGGCCAGTCATG…Complement: CATCGATCGTAATCGTACCGGTCAGTAC…
Probe 1: CATCGATCGTAATCGTACCGGTCA
Probe 2: ATCGATCGTAATCGTACCGGTCAG
Probe 3: TCGATCGTAATCGTACCGGTCAGT
… …
Tilings of 8 genes (from E. coli & B. subtilus) Every possible probe (~10,000 probes) Genes known to be expressed in sample
Our Microarray
0 99
Defining our Categories
Normalized Probe Intensity
Low Intensity = BAD Probes
(45%)
High Intensity = GOOD
Probes (32%)
Mid-Intensity = Not Used in Training Set
(23%)
Frequenc
y
0 .05 .15 1.0
The Machine Learning Techniques
Naïve Bayes (Mitchell 1997)
Neural Networks (Rumelhart et al. 1995)
Decision Trees (Quinlan 1996)
Can interpret predictions of each learner probabilistically
Naïve Bayes
Assumes conditional independence between features
Make judgments about test set examples based on conditional probability estimates made on training set
Naïve Bayes
For each example in the test set, evaluate the following:
ilowivalueifeaturePlowP
ihighivalueifeaturePhighP
)|()(
)|()(
Neural Network(1-of-n encoding with probe length = 3)
Example probe
sequence: “CAG”
Weights
ACTIVATI
O
NERROR
Good or Bad…
…
A2
C2
G2
T2
A3
C3
G3
T3
A1
C1
G1
T1
Decision Tree
n14
fracAC
fracT
fracTC
Bad Probe … … …
Good Probe
…
…
Automatically builds a tree of rules
High
…
…Low
High
Low High
…
Low High
Low
High
C G TA
fracC
fracG
fracG
Low
Low
High
Decision Tree
The information gain of a feature, F, is:
)(||
||)(
),(
)(v
FValuesv
v SEntropyS
SSEntropy
FSnGainInformatio
Information Gain per Feature
CG
CC
C
A G
T
AA
AC
AG
ATCA
CTGA GG
TC
GC TAGT TT
TG0.0
1.0
22 2324 1 2 3 4 5 6 789 10
11 1213 1415 16 1718 19 20
21 22232119 2017181614 151311 129 108764 51 2 3
0.0
1.0
Probe Composition Features
Norm
aliz
ed
In
form
ati
on
Gain
Base Position Features
Base Position
Dimer Position
Cross-Validation
Leave-one-out testing: For each gene (of the 8)
Train on all but this geneTest on this geneRecord resultForget what was learned
Average results across 8 test genes
Typical Probe-Intensity Prediction Across Short Region
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
650 655 660 665 670 675 680 685 690 695 700
Actual
Norm
aliz
ed
Pro
be In
ten
sity
Starting Nucleotide Position for 24-mer Probe
Typical Probe-Intensity Prediction Across Short Region
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
650 655 660 665 670 675 680 685 690 695 700
Naïve Bayes Decisio
n Tree
Neural Network
Actual
Norm
aliz
ed
Pro
be In
ten
sity
Starting Nucleotide Position for 24-mer Probe
Probe-Picking Results
0
2
4
6
8
10
12
14
16
18
20
0 2 4 6 8 10 12 14 16 18 20
Nu
mb
er
of
pro
bes
sele
cted
wit
h
inte
nsi
ty >
= 9
0th p
erc
enti
le
Number of probes selected
Perfect Selector
Probe-Picking Results
0
2
4
6
8
10
12
14
16
18
20
0 2 4 6 8 10 12 14 16 18 20
Nu
mb
er
of
pro
bes
sele
cted
wit
h
inte
nsi
ty >
= 9
0th p
erc
enti
le
Number of probes selected
Naïve Bayes
Neural Network
Decision Tree
Primer Melting Point
Perfect Selector
Current and Future Directions
Consider more features Folding patterns Melting point
Feature selection
Evaluate specificity along with sensitivity Ie, consider false positives
Evaluate probe selection + gene calling
Try more ML techniques SVMs, ensembles, …
Take-Home Message
Machine learning does a good job on this part of probe-selection problem Easy to collect large number of training
ex’s Easily measured features work well
Intelligent probe selection can increase microarray accuracy and efficiency
Acknowledgements
NimbleGen Systems, Inc. for providing the intensities from the eight tiled genes measured on their maskless array. Darryl Roy for helping in creating the training data. Grants NIH 2 R44 HG02193-02, NLM 1 R01 LM07050-01, NSF IRI-9502990, NIH 2 P30 CA14520-29, and NIH 5 T32 GM08349.
Thanks