View
215
Download
3
Category
Preview:
Citation preview
Sample-Based Estimators for theInstrinsically Multivariate Prediction Score
Ting ChenDepartment of Electrical Engineering
Texas A & M University
College Station, Texas 77843
Email: chenting@tamu.edu
Ulisses Braga-Neto∗Department of Electrical Engineering
Texas A & M UniversityCollege Station, Texas 77843
Email: ub@ieee.org (Corresponding Author)
Abstract—Canalizing genes possess broad regulatory powerover gene regulatory networks. In a previous publication, theconcept of intrinsically multivariate predictive (IMP) genes wasintroduced and analyzed in the context of stochastic logic models.Furthermore, based on an empirical study of the DUSP gene, acanalizing gene in melanoma, it was hypothesized that canalizinggenes possess IMP properties. In this paper, we study the problemof sample-based estimation of a gene IMP score. We studynonparametric IMP score estimators based on resubstitution,leave-one-out, cross-validation, and bootstrap, and introducea maximum-likelihood IMP score estimator for a many-inputstochastic logic model. Assuming a two-input, three-input andfour-input stochastic AND model, performance metrics of theseestimators are calculated by Monte Carlo sampling. Our resultsshow that the ML IMP score estimator outperforms the otherestimators in RMS, under the assumed stochastic logic model.It is followed by the resubstitution IMP score estimator. Thisindicates that, provided one has information about regulatoryrelationships in the network, the ML IMP score estimator is theestimator of choice, whereas resubstitution is to be preferred inthe absence of prior knowledge.
Index Terms—Intrinsically Multivariate Prediction, MaximumLikelihood Estimation, Stochastic Logic, Gene Regulatory Net-works, Prediction Error Estimation.
I. INTRODUCTION
The existence of canalizing genes that can constrain a
biological system to particular functions was proposed by
C. Waddington in [1]. Such canalizing genes are frequently
found in signaling pathways, which deliver information from
a variety of sources to the machinery that enacts central
cellular functions such as cell-cycle, survival, apoptosis and
metabolism. For example, gene DUSP1 is canalizing in it
phosphorylated state, which is a central component of a
process-integrating pathway implicated in melanoma methas-
tasis. Martins and collaborators [3] hypothesized that when
the controlling gene is active, it cannot be well-predicted by
subsets of its predictor genes, but it can be predicted by the
full set with great accuracy. Such a set of predictor genes is
called Intrinsically Multivariate Predictive (IMP) for the target
gene in [3], where it was shown that DUSP1 presents a large
number of IMP gene sets in its pathway.
The concept of IMP gene is defined in terms of the binary
Coefficient of Determination (CoD) [2]. As such, IMP depends
on the probability model connecting predictors and target,
which, however, is usually unknown, or only partially known.
Therefore, the problem arises of how to find and characterize
the performance of sample-based IMP score estimators. In
[6], we have studied the performance of four nonparamet-
ric sample-based CoD estimators, based on resubstitution,
leave-one-out, cross-validation and bootstrap prediction error
estimators. We introduce in this paper the corresponding
nonparametric IMP score estimators. In addition, we extend
the two-input stochastic logic models studied in [3, 5] to
the many-input case and propose the Maximum-Likelihood
IMP score estimator for this class of models, based on the
corresponding maximum-likelihood CoD estimator introduced
in [5]. Here, we consider the two-input, three-input and four-
input stochastic AND model, and calculate approximate per-
formance metrics, namely, bias, variance and RMS, of the ML
IMP score estimators as a function of predictive power using a
Monte-Carlo sampling approach. The results indicate that the
ML IMP score estimator is the estimator of choice, under the
assumed stochastic logic model, whereas resubstitution is to
be preferred in the absence of prior knowledge.
The paper is organized as follows. Section II introduces
the many-input stochastic logic model, and the maximum
likelihood estimators of its parameters are studied. In Section
III, we define several IMP score estimators in analogy to
the corresponding problem of CoD estimation, and analyze
the bias, variance and RMS of these estimators approximated
using a Monte-Carlo sampling approach. Section IV compares
the performance metrics of these IMP score estimators. Fi-
nally, Section V presents concluding remarks.
II. STOCHASTIC LOGIC MODEL
In Genomic Signal Processing, Boolean (logic) circuits play
a prominent role in modeling gene regulatory networks [3,
4]. However, noise in the sample data affects the Boolean
functions, and causes “inconsistence” between the sample data
and a deterministic logic circuit. To address this problem, we
introduce next a many-input stochastic logic model, which
extends the two-input logic model introduced in [3]. The logic
gates in this class of models are replaced by a joint probability
distribution between predictors and target.
Let X = (X1, . . . , Xd) be a binary predictor vari-
able set and Y be the target variable. To formulate
2011 IEEE International Workshop on Genomic Signal Processing and StatisticsDecember 4-6, 2011, San Antonio, Texas, USA
978-1-4673-0490-0/11/$26.00 ©2011 IEEE 139
P (X = x), we develop an approach to measure co-
variance among predictors. Suppose any {i1, i2, . . . , ir} ⊆{1, . . . , d}, we define γ(i1, i2, . . . , ir) = E[Xi1Xi2 . . . Xir]−E[Xi1]E[Xi2] . . . E[Xid]. Note that, for d = 2, γ(1, 2) =E[X1X2] − E[X1]E[X2]. Based on this definition, the joint
probability of (X, Y ) for the many-input stochastic logic
model is given next without proof due to space limitation.
Many-Input Logic Model: Let f : {0, 1}d → {0, 1}be a given Boolean function (logic gate), and let Sd ={1, 2, . . . , d}. Then
P (Y = 1|X1 = x1, . . . , Xd = xd)
= pf(x1,...,xd)(1− p)1−f(x1,...,xd)(1)
while
P (X1 = x1, . . . , Xd = xd)
=
d∏i=1
P xii (1− Pi)
1−xi + (−1)∑d
i=1 xi×
∑{i1,...,ir}⊆Sd
⎧⎨⎩(−1)rγ(i1, . . . , ir)∏
k∈Sd\{i1,...,ir}(1− xk)
⎫⎬⎭ ,
(2)
where p = P (f(X1, . . . , Xd) = Y ) is the predictive power,
Pi = E[Xi] = P (Xi = 1), i = 1, 2, . . . , d are the predictor
“biases” (the value 0.5 being considered unbiased), and r ≥ 2.
Eqs. (1) and (2) fully determine the joint distribution P (X1 =x1, . . . , Xd = xd, Y = y) = P (Y = y|X1 = x1, . . . , Xd =xd)P (X1 = x1, . . . , Xd = xd).
The two-input logic model (i.e., d = 2) is a special case of
the many-input logic model, which is given next.
Two-Input Logic Model: For a given Boolean function (logic
gate) f : {0, 1}2 → {0, 1}, let
P (Y = 1|X1 = x1, X2 = x2) = pf(x1,x2)(1− p)1−f(x1,x2)
(3)P (X1 = x1, X2 = x2) = P x1
1 P x22 (1− P1)
1−x1(1− P2)1−x2
+ (−1)x1+x2γ.(4)
Note that the predictive power p concerns only the stochastic
logic gate, whereas P1, . . . , Pd and γ’s up to the order dconcern only the marginal distribution of the d predictors. We
will assume throughout that p ≥ 1/2, since if p < 1/2 one
obtains the negated logic gate with predictive power 1− p.
A. Maximum-Likelihood Estimation of Model Parameters
In the absence of complete distributional knowledge, one
must estimate the model parameters from i.i.d. sample data
Sn = {(X11, . . . , X1d, Y1), . . . , (Xn1, . . . , Xnd, Yn)}, which
is assumed to be drawn from the probability model. The
ML estimators of the model parameters are obtained by
substituting sample averages for expectations. This is the basic
fact used to obtain the following proposition, which is given
without proof.
Proposition 1. The maximum-likelihood estimators of theparameters of the many-input logic model are given by
p =1
n
n∑i=1
1f(Xi1,...,Xid)=Yi,
Pi =1
n
n∑i=1
Xij , for j = 1, 2, . . . , d,
γ (i1, . . . , ir) =1
n
n∑j=1
[r∏
k=1
Xjik
]− 1
nr
r∏k=1
[n∑
i=1
Xjik
],
for (i1, . . . , ir) ⊆ Sd, and r ≥ 2.(5)
It is easy to show that p and Pi, for (i = 1, 2, . . . , d),are minimum-variance unbiased, with Var[p] = 1
np(1 − p),
Var[Pi] =1nPi(1 − Pi), for i = 1, 2, . . . , d. However, γ is a
biased estimator. As ML estimators, all these estimators are
asymptotically unbiased, asymptotically efficient, and consis-
tent [7].
As a special case, the maximum-likelihood estimators
of the parameters of two-input logic model are given by
p = 1n
∑ni=1 1f(Xi1,Xi2)=Yi
, P1 = 1n
∑ni=1 Xi1 P2 =
1n
∑ni=1 Xi2, γ = 1
n
∑ni=1 Xi1Xi2 − 1
n2
∑ni=1 Xi1
∑ni=1 Xi2.
III. INTRINSICALLY MULTIVARIATE PREDICTION
The Coefficient of Determination (CoD) of X with respect
to Y [2] is defined to be
CoDY(X) =εY − εX,Y
εY(6)
where εY is the optimal error of predicting Y in the absence
of other observations and εX,Y is the optimal error based
on the observations of X . The CoD measures the nonlinear
multivariate relationship between predictors and target. By
convention, one assumes 0/0 = 1 in the above definition.
Given the many-input model (1), the CoD is expressed by
CoD = 1− 1− p
F (∑
x P (Y = 1|X = x)P (X = x))
= 1− 1− p
F(∑
x pf(x)(1− p)1−f(x)P (X = x)
) . (7)
Martins et al. (2008) introduced the concept of an intrinsi-
cally multivariate predictive (IMP) gene set: X is said to be
IMP for Y with respect to λ and δ, for 0 ≤ λ < δ ≤ 1, if
maxZ�X
CoDY(Z) ≤ λ and CoDY(X) ≥ δ. (8)
Subsequently, [3] defined the IMP score of a pair (X, Y ) as
IMPY(X) = CoDY(X)−maxZ�X
CoDY(Z), (9)
where Z �= ∅. This definition is independent of λ and δ;
instead one sets a threshold, and if the IMP score exceeds
this threshold, then X is said to be IMP for the target Y .
In the two-predictor case, the IMP score is given by
IMPY(X1,X2)
= CoDY(X1,X2)−max{CoDY(X1),CoDY(X2)} ,(10)
140
A. Estimation of the IMP Score
In the CoD estimation problem [6], we defined “model-
free” CoD estimators based on resubstitution, leave-one-out,
2-fold 10-repeated cross-validation and .632 bootstrap error
estimators. Likewise, we introduce here the corresponding
IMP score estimators: the resubstitution IMP score estimator
(IMPr), leave-one-out IMP score estimator (IMPl), 2-fold
10-repeated cross-validation IMP score estimator (IMPcv) and
.632 bootstrap IMP score estimator (IMPb632), given by:
IMPY (X) = CoDY (X)−maxZ�X
CoDY (Z), (11)
where IMP and CoD are one of the four IMP and CoD
estimators, respectively.
If prior knowledge about the distribution of (X, Y ) is
known, in the form of the stochastic logic model of Sec-
tion 2, one can obtain a Maximum-Likelihood (ML) IMP
score estimator (IMPML) as a function of ML CoD estimators
[6]. Notice that, in the two-input logic model as an example,
the true IMP in (9) is a function of the model parameters
p, P1, P2 and γ: IMP = g(p,P1,P2, γ). By the principle of
ML invariance [7], to obtain ML IMP score estimators one
plugs in the ML estimators of the model parameters into g.
For example, by combining (3), (4), (6) and (10), we can
obtain the true IMP in the two-input AND model formulated
by (for the sake of simplicity, we will omit from this point on
the explicit reference to (X, Y ) in IMP notation):
IMPML = 1− F(p)
F[A]− max
{1− F(p)(1− P1) + F[B]P1
F[A],
1− F (p)(1− P2) + F [C]P2
F [A]
}(12)
where A = P1P2 + γ + (1 − 2P1P2 − 2γ)p , B = ((P1 −P1P2 − γ) + (2P1P2 + 2γ − P1)p)/P1, C = ((P2 − P1P2 −γ)+ (2P1P2+2γ−P2)p)/P2 and F (x) = min(x, 1−x), for
0 ≤ x ≤ 1. Hence, the corresponding ML IMP score estimator
for this logic model is
IMPML = 1− F(p)
F[A]− max
{1− F(p)(1− P1) + F[B]P1
F[A],
1− F (p)(1− P2) + F [C]P2
F [A]
},
(13)
where A, B and C are obtained by replacing P1, P2, γ with
P1, P2, γ in the formulations of A,B and C, respectively. The
ML IMP score estimator for the three-input or four-input logic
model can be derived in a similar fashion.
B. Performance of IMP Score Estimators
Regarding the performance of the IMP score estimator
IMP, the quantities of interest are the bias, variance, and
RMS, given by Bias[IMP] = E[IMP] − IMP, Var[IMP],
and RMS[IMP] =
√Bias[IMP]2 +Var[IMP], respectively.
A good IMP score estimator will display small values for all
these metrics.
We employ a Monte-Carlo sampling approach [8] to approx-
imate the bias, variance and RMS of IMP score estimators.
Assuming one specific logic model with known parameter
values, we draw 5000 i.i.d. Monte-Carlo samples from the
joint probability distribution given by the product of eq. (1)
and eq. (2). For each sample data set, we calculate the ML,
resubstitution, leave-one-out, cross-validation, and bootstrap
IMP score estimates, respectively. Then, we obtain the mean,
variance, and RMS of the corresponding IMP score estimators.
IV. NUMERICAL EXPERIMENTS
Assuming a stochastic AND model, we plot the approximate
performance metrics of the ML IMP score estimator as a
function of predictive power in the two-input, three-input, and
four-input cases. We also compare these with the approximate
performance metrics for resubstitution, leave-one-out, cross-
validation and bootstrap IMP score estimators, as shown in
Figure 1.
Figure 1 shows that, while a clearly superior estimator
in bias does not emmerge, the ML IMP score estimator is
clearly the least variable estimator, whereas the leave-one-
out is generally the most variable one. Most importantly,
we can see on the RMS column that the ML IMP score
estimator is able to outperform all others. Among the model-
free estimators, resubstitution is clearly the superior choice in
RMS. Notice also that, as the number of inputs (m) in a logic
model increases, there is an increase in the amount by which
the ML IMP score estimator beats the others in RMS, since
the complexity of estimation increases with larger m.
V. CONCLUSION
In this paper, we introduced the estimation problem for
the intrinsically multivariate prediction (IMP) score. We pro-
posed resubstitution, leave-one-out, cross-validation and boot-
strap IMP score estimators. Furthermore, we developed the
maximum-likelihood estimator for the IMP score under a
stochatic many-input logic model. Assuming specific stochas-
tic AND models, we compared their performance metrics
via Monte-Carlo sampling. We conclude from our results
that the ML IMP score estimator is the estimator of choice,
whereas resubstitution is to be preferred in the absence of prior
knowledge.
The paper of Martin et al. (2008) employed the ML CoD
estimator in 2-input stochastic logic model to real melanoma
dataset, and concluded the IMP criterion could be applied as a
practical tool for the identification of critical canalizing genes
in real gene expression data. Our main goal in this paper was to
validate the performance of the ML IMP score estimators as
compared with resubstitution, leave-one-out, cross-validation
and bootstrap from a theoretical perspective. Further research
will be focused on investigating and comparing how these IMP
score estimators reveal the multivariate relationships in gene
regulatory networks and identify canalizing genes in practice.
It is hoped that the ML IMP score estimator would bring
141
Bias Variance RMS
m = 2
0.5 0.6 0.7 0.8 0.9 1.0
−0.4
−0.3
−0.2
−0.1
0.0
0.1
predictive power
bias
MLresubloobootstrapcv
0.5 0.6 0.7 0.8 0.9 1.0
0.00
0.02
0.04
0.06
0.08
0.10
predictive power
varia
nce
MLresubloobootstrapcv
0.5 0.6 0.7 0.8 0.9 1.0
0.0
0.1
0.2
0.3
0.4
predictive power
RM
S
MLresubloobootstrapcv
m = 3
0.5 0.6 0.7 0.8 0.9 1.0
−0.4
−0.3
−0.2
−0.1
0.0
0.1
predictive power
bias
MLresubloobootstrapcv
0.5 0.6 0.7 0.8 0.9 1.0
0.00
0.02
0.04
0.06
0.08
0.10
predictive power
varia
nce
MLresubloobootstrapcv
0.5 0.6 0.7 0.8 0.9 1.0
0.0
0.1
0.2
0.3
0.4
predictive power
RM
S
MLresubloobootstrapcv
m = 4
0.5 0.6 0.7 0.8 0.9 1.0
−0.4
−0.3
−0.2
−0.1
0.0
0.1
predictive power
bias
MLresubloobootstrapcv
0.5 0.6 0.7 0.8 0.9 1.0
0.00
0.02
0.04
0.06
0.08
0.10
predictive power
varia
nce
MLresubloobootstrapcv
0.5 0.6 0.7 0.8 0.9 1.0
0.0
0.1
0.2
0.3
0.4
predictive power
RM
S
MLresubloobootstrapcv
Fig. 1. Bias, deviation variance, and RMS for several IMP score estimators vs. predictive power values in the two-input AND model (fixing P1 = 0.4, P2 = 0.5and γ = 0.005), three-input AND model (fixing P1 = 0.5, P2 = 0.6, P3 = 0.65, γ(1, 2) = 0.01, γ(1, 3) = 0.025, γ(2, 3) = 0.035, and γ(1, 2, 3) = 0.02)and four-input AND model ( fixing P1 = 0.5, P2 = 0.6, P3 = 0.65, γ(1, 2) = 0.01, γ(1, 3) = 0.025, γ(2, 3) = 0.035, and γ(1, 2, 3) = 0.02), assumingsample size n = 60, respectively. All curves are obtained via Monte-Carlo sampling.
more accurate biological information than others regarding its
advantageous performance shown in our theoretical analysis.
REFERENCES
[1] C.H. Waddington, Canalization of development and the inheritance ofacquired characters, Nature (1942) 563–565.
[2] E.R. Dougherty, S. Kim, Y.D. Chen, Coefficient of determination innonlinear signal processing, Signal Processing 80 (2000) 2219–2235.
[3] D.C. Martins, U.M. Braga-Neto, R.F. Hashimoto, M.L. Bittner,E.R. Dougherty, Intrinsically multivariate predictive genes, IEEE Journalof Selected Topics in Signal Processing 2 (3) (2008) 424–439.
[4] I. Shmulevich, E.R. Dougherty, s. Kim and W. Zhang, ProbabilisticBoolean networks: a rule-based uncertainty model for gene regulatorynetworks, Bioinformatics 18 (2) (2002) 261–274.
[5] T. Chen, U.M. Braga-Neto, Maximum Likelihood Estimation of TheBinary Coefficient of Determination, Asilomar Conference on Signals,Systems & Computers, Pacific Grove, CA, November 2011.
[6] T. Chen, U.M. Braga-Neto, Exact performance of CoD estimators indiscrete prediction, EURASIP Journal of Advances in Signal Processing:Special Issue on Genomic Signal Processing (2010).
[7] G. Casella, R.L. Berger, Statistical Inference, Duxbury Press, 2002.[8] C.P. Robert, G. Casella, Monte Carlo statistical methods, Springer,
New York, 1999.
142
Recommended