Upload
independent
View
0
Download
0
Embed Size (px)
Citation preview
T
Ca
JU
a
ARRA
KGCH
1
rdoeceda[at(sadiaadf(
h1
Biomedical Signal Processing and Control 13 (2014) 337–344
Contents lists available at ScienceDirect
Biomedical Signal Processing and Control
jo ur nal homepage: www.elsev ier .com/ locate /bspc
echnical Note
onfidence masks for genome DNA copy number variations inpplications to HR-CGH array measurements
orge Munoz-Minjares, Jesús Cabal-Aragón, Yuriy S. Shmaliy ∗
niversidad de Guanajuato, Department of Electronics Engineering, Salamanca, 36885 Gto., Mexico
r t i c l e i n f o
rticle history:eceived 21 November 2013eceived in revised form 21 May 2014ccepted 19 June 2014
a b s t r a c t
The array-comparative genomic hybridization (aCGH) and next generation sequence technologies enablecost-efficient high resolution detection of DNA copy number variations (CNVs). However, while the CNVsestimates provided by different methods are often inconsistent with each other, still a little can be foundabout the estimation errors. Based on our recent studies of the confidence limits for stepwise signals
eywords:enome copy number variationsonfidence limit masksR-CGH microarray
measured in noise, we develop an efficient algorithm for computing the confidence upper and lowerboundary masks in order to guarantee an existence of genomic changes with required probability. Wesuggest combining these masks with estimates in order to give medical experts more information abouttrue CNVs structures. Applications given for high-resolution CGH microarray measurements ensure thatthere is a probability that some changes predicted by an estimator may not exist.
© 2014 Elsevier Ltd. All rights reserved.
. Introduction
The deoxyribonucleic acid (DNA) of a genome is commonlyecognized to be essential for human life. The DNA is usuallyouble-stranded, therefore the size of a gene or chromosome isften measured in base pairs. A unit of measurement is kilobase (kb)qual to 1000 bp of DNA [1]. The DNA can demonstrate structuralhanges called copy-number variations (CNVs) associated with dis-ase such as cancer [2]. The copy number variation (CNV) can beefined as a DNA segment of one kb or larger that is present at
variable copy number in comparison with a reference genome3]. The human genome with 23 chromosomes is estimated to bebout 3.2 billion base pairs long and to contain 20,000–25,000 dis-inct genes [2]. It is known [4] that each copy number variationCNV) may range from about 1 kb to several megabases (Mbs) inize. To detect the CNVs at a resolution level of 10–25 kbs [5], therray-comparative genomic hybridization (aCGH) technique waseveloped employing chromosomal microarray analysis. Although
t was reported that the high-resolution CGH (HR-CGH) arrays areccurate to detect structural variations at resolution of 5 kbs [6]nd even 200 bp [7], their regular resolution is still insufficient to
etect short genomic changes. A progress was achieved in the lastew years following the emergence of the next generation sequenceNGS) technologies which allows for the detection of CNVs with∗ Corresponding author. Tel.: +52 464 647 01 95; fax: +52 464 647 24 00.E-mail addresses: [email protected], [email protected] (Y.S. Shmaliy).
ttp://dx.doi.org/10.1016/j.bspc.2014.06.006746-8094/© 2014 Elsevier Ltd. All rights reserved.
resolution < 10 kbp [8]. The NGS approach has generated extensivedevelopments of the CNVs detection methods [9–12] and sev-eral such methods obtaining resolution of 0.8–6 kb were recentlyreviewed in [13].
Even though the conceptual steps in aCGH and CNV-seq meth-ods are different, the outputs are typically represented in the samescales [9]. The genomic location is often given with the nth probes,n ∈ [1, M], where M is the number of probes following with a unitstep ignoring “bad” or empty probes. The nlth discrete point corre-sponds to the ilth edge (breakpoint) in the genomic location scalein kb or Mb which is finally used to represent the CNVs. The break-points are placed as 0 < n1 < · · · < nL < M, where nl, l ∈ [1, L], is the lthbreakpoint and L is the number of the breakpoints. The CNVs arerepresented with L + 1 segmental constant changes aj, j ∈ [1, L + 1],characterizing a segment between ij−1 and ij on an interval [ij−1,ij − 1]. Typically, the CNVs are normalized as log 2 R/G = log 2 Ratio,where R and G are the fluorescent Red and Green intensities, respec-tively [14].
Genome CNVs are stepwise sparse with a limited number ofbreakpoints [15]. The detected structure is usually contaminatedby intensive noise [16]. In the log 2 Ratio scale, noise is commonlymodeled as white Gaussian with equal or different segmental vari-ances. Various statistical approaches have been developed in orderto estimate CNVs in array CGH and NGS data. Two main goals are
[17]: (1) to infer the number and statistical significance of thealterations and (2) to locate their boundaries accurately. A num-ber of statistical methods were tested by CNVs measurements asobserved in [18], including wavelet-based, robust, adaptive kernel3 nal Processing and Control 13 (2014) 337–344
saomr
ngleaffbbsho
aapn(ioUmgpS
2
ipafta
2
atnd
p
ibtpttd
p
w√
Fig. 1. Genomic changes with a single breakpoint at nl: (a) log2 Ratio, (b) segmentalGaussian distributions with different variances, and (c) skew Laplace jitter distribu-tion in the breakpoint at nl .
Table 1Probability measures for genomic changes.
ϑ P (%) � (%)
Even chances 0.6745 50 501-Sigma 1 68.27 31.73Probable 1.15035 75 25Almost certain 1.81191 93 7Typical confidence 1.96 95 52-Sigma 2 95.45 4.553-Sigma 3 99.73 0.27
38 J. Munoz-Minjares et al. / Biomedical Sig
moothers, maximum likelihood (ML), penalized bridge estimatornd ridge regression, fussed least-absolute shrinkage and selectionperator (Lasso), the Schwarz information criterion-based esti-ator, and forward-backward smoothers. Quite comprehensive
eviews of most common algorithms were given in [17,13].It has to be remarked now that, in view of large detection noise,
o one estimator even ideal is able to provide a clear picture ofenomic changes [17]: the estimates are often accompanied witharge segmental errors and jitter in the breakpoints. Thus, medicalxperts may have insufficient information for a correct decisionbout the true CNVs structure [15]. Even so, still a little can beound in literature about the estimation errors, disregarding theact that an existence of large jitter in the CNV breakpoints haseen shown experimentally in [19]. The problem is complicatedy the fact that exact jitter distribution is still unknown for suchignals even in white Gaussian noise. Just recently, in [18,20], weave derived an approximate jitter distribution and showed that itbeys the discrete skew Laplace law.
In this paper, we introduce a statistical framework and developn efficient algorithm for computing the confidence lower bound-ry (LB) and upper boundary (UB) masks for CNVs. The masksroposed can be applied to measurements conducted by any tech-ology, although we give applications only to high-resolution CGHHR CGH) microarray data available from [21]. The rest of the papers organized as follows. In Section 2, we consider a statistical modelf genomic changers. A computational algorithm for the confidenceB and LB masks is developed in Section 3. Testing of some HR CGHicroarray-based CNVs measurements by the confidence masks is
iven in Section 4. Discussions of segmental errors and jitter arerovided in Section 5. Finally, concluding remarks can be found inection 6.
. Confidence masks for genomic changes
Estimation error bounds for genome DNA CNVs can be learnedf to specify statistically noise in segments and jitter in the break-oints [22]. Based upon the statistical properties of segmental noisend jitter in the breakpoints, the confidence UB and LB masks can beormalized for CNVs to exist with some probability. The masks, inurn, can serve as additional measures for experts to make decisionsbout possible genomic changes.
.1. Statistical modeling of detected genomic changes
A typical genomic change detected around the lth breakpointt nl can be illustrated as shown in Fig. 1a. One recognizes herewo segments at levels al and al+1. In the log2 Ratio scale, segmentaloise is often supposed to be white [23] and modeled with Gaussianensity [24]
l(x) = 1√2��2
l
exp
[− (x − al)
2
2�2l
](1)
n which the segmental noise variances �2l
along the genome cane different [15] as shown in Fig. 1b. In the presence of segmen-al noise, the breakpoint location cannot be predicted with unitrobability. An estimator may find it several points to the left or tohe right from a true location – thus jitter. We have shown in [18]hat the jitter distribution measured in [19] can approximately beescribed with the discrete skew Laplace law [25] (Fig. 1c)
(k|d , q ) = (1 − dl)(1 − ql){
dkl, k ≥ 0,
(2)
l l 1 − dlql q|k|l, k ≤ 0,
here dl = e−(�l/�l) ∈ (0, 1), ql = e−(1/�l�l) ∈ (0, 1), �l =(ln xl/(ln(xl/�l))), and �l = − (�l/ln xl) > 0. Auxiliary functions
Certain ∞ 100 0
connected with the probabilities for each detected point to belongto one segment or another are specified as
xl = l(1 + �l)2(1 + l)
(1 −√
1 + 4�l(1 − 2l)
2l(1 + �l)
2
), (3)
�l = PAl
(1 − PBl
)
PBl
(1 − PAl
), (4)
l = PAl
+ PBl
− 1
(1 − 2PAl
)(1 − 2PBl
), (5)
PAl =
⎧⎪⎪⎪⎪⎨⎪⎪⎪
1 + 12
[erf(gˇl
) − erf(g˛l )], −
l< +
l,
12
erfc(g˛1 ), −
l= +
l, (6)
⎪⎩ 12
[erf(gˇl
) − erf(g˛l )], −
l> +
l,
J. Munoz-Minjares et al. / Biomedical Signal Processing and Control 13 (2014) 337–344 339
omosome taken from the archive “159A–vs–159D–cut” of ROMA.
P
w√ia
˛
i
m
w
(etma
2
twNt
Ens
a
Table 2Algorithm for computing the confidence UB mask BU
n and LB mask BLn via detected
CNVs yn and given breakpoint locations nl for the bound wideness ϑ.
Input: yn , nl , ϑ, L, M1: NL+1 = M − nL , n0 = 0, � = erfc(ϑ/
√2)
2: for j = 1 : L + 1 do3: Nj = nj − nj−1, aj = 1
Nj
∑nj−1
v=nj−1yv
4: �j =√
1Nj
∑nj−1
v=nj−1(yv − aj)
2
5: and for6: for l = 1 : L do
7: �l = al+1 − al , −l
= �2l
�2l
, +l
= �2l
�2l+1
8: ˛l by (8) with “ − ′′ and al = al
9: ˇl by (8) with “ + ′′ and al = al
10: gˇl
= ˇl−al|�l |
√−
l2 , g˛
l= ˛l−al
|�l |
√−
l2
11: hˇl
= ˇl−al|�l |
√+
l2 , h˛
l= ˛l−al
|�l |
√+
l2
12: PAl
by (6), PBl
by (7), l by (5)
13: �l = PAl
(1−PBl
)
PBl
(1−PAl
), xl by (3), �l =
√ln xl
ln(xl/�l ),
14: �l = − �lln(xl )
, dl = e− �l�l , ql = e− 1
�l�l
15: kRl
by (14), � right jitter16: kL
lby (15), � left jitter
17: and for18: CL+1 = M − 1, DL+1 = M − 119: for l = 1 : L do
20: Cl ={
nl − kRl
if �l > 0nl + kL
lif �l < 0
21: Dl ={
nl + kLl
if �l > 0nl − kR
lif �l < 0
22: and for23: for l = 1 : L do
24: Cl ={ Cl if Im Cl = 0
Cl−1 if �l ≥ 0 and Im Cl /= 0Cl+1 if �l < 0 and Im Cl /= 0
25: Dl ={ Dl if Im Dl = 0
Dl+1 if �l ≥ 0 and Im Dl /= 0Dl−1 if �l < 0 and Im Dl /= 0
26: and for27: l = 1, k = 128: for n = 0 : M − 1 do
29: l ={
l if n < Cl
l + 1 if n � Cl and Cl+1 > Cl
l + 2 if n � Cl and Cl+1 � Cl
30: k ={
k if n < Dl
k + 1 if n � Dl and Dl+1 > Cl
k + 2 if n � Dl and Dl+1 � Cl
31: BUn = al + ϑ
√�2
lNl
� UB mask√
Fig. 2. Measurements and estimates of a part of the 7th chr
Bl =
⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩
12
[erf(h˛l ) − erf(hˇ
l)], −
l< +
l,
1 − 12
erfc(h˛l ), −
l= +
l,
1 + 12
[erf(h˛l ) − erf(hˇ
l)], −
l> +
l,
(7)
here gˇl
= ((ˇl − �l)/|�l|)√
(−l
/2), g˛l
= ((˛l − �l)/|�l|)(−
l/2), hˇ
l= (ˇl/|�l|)
√(+
l/2), h˛
l= (˛l/|�l|)
√(+
l/2), erf(x)
s the error function, erfc(x) is the complementary error function,nd
l, ˇl = al−l
− al+1+l
−l
− +l
∓ 1−
l− +
l
×
√√√√(al − al+1)−l
+l
+ 2�2l(−
l− +
l) ln
√−
l
+l
(8)
f −l
/= +l
. For −l
= +l
, set ˛l = �l/2 and ˇl =± ∞.The signal-to-noise rations (SNRs) in the lth and (l + 1)th seg-
ents are specified as, respectively,
−l
= �2l
�2l
, +l
= �2l
�2l+1
, (9)
here the segmental change is �l = al+1 − al.Distributions (1) and (2) suggest that due to segmental noise
Fig. 1a) there is always a probability (often high) of segmentalrrors (often large) and jitter in the breakpoints (sometimes essen-ial) irrespective of the estimator used. Of prime interest is thus how
uch are these errors and which confidence limit masks guaranteen existence of genomic changes with sufficient probability.
.2. Confidence UB and LB masks
Let us suppose that the estimate nl of the lth breakpoint loca-ion is available. At least, it can be assigned visually. In view ofhite nature of noise, simple averaging applied on an interval ofl = nl − nl−1 points from nl−1 to nl − 1 gives the best estimate for
he lth segmental level al = (1/Nl)∑nl−1
v=nl−1yv, which mean value is
{al} = al and variance �2l
= (�2l
/Nl). Because �2l
is commonly notegligible, segmental errors occur. The confidence UB and LB for
egmental estimates can thus be specified in the ϑ-sigma sense asˆUBl
∼= al + ε = al + ϑ
√�2
l
Nl= al + ϑ �l, (10)
32: BLn = ak − ϑ
�2k
Nk� LB mask
33: and forOutput: BU
n , BLn
3 nal Processing and Control 13 (2014) 337–344
a
widu
etnges(m7sc
as
J
J
w
k
k
cjF�s
3
bIwM
tsaCcmcbtbt
4
rswHo
Fig. 3. UB mask and LB mask for the estimates of the CNVs shown in Fig. 1: (a)genomic location from 130 Mb to 146 Mb and (b) genomic location from 146 Mb to156 Mb. Jitter in i1, i6, i7, i9, i10, i12, and i13 is moderate and these breakpoints are well
40 J. Munoz-Minjares et al. / Biomedical Sig
ˆLBl
∼= al − ε = al − ϑ
√�2
l
Nl= al − ϑ �l, (11)
here ϑ indicates the bound wideness in terms of �l . The probabil-ty � for the segmental estimate to exceed a threshold ε stronglyepends on the segmental length Nl and can be determined,sing (1) and al , as �(Nl) = 2
∫ ∞al+ε
pl(x)dx = erfc(�l
√(Nl/2)), where
rfc(x) is the complementary error function and �l = (ε/√
�2l
) ishe normalized threshold. A distinctive feature of � is that it doesot depend on the unknown al. By combining ε and �l in �(Nl) weet �(Nl) = erfc(ϑ/
√2) and the confidence interval for segmental
stimate becomes P(Nl) = 1 − �(Nl) = 1 − erfc(ϑ/√
2). Table 1 giveseveral values of ϑ, P, and � for likely existing genomic changes> 50 %). As can be seen, the 1-sigma sense (ϑ = 1) occupies an inter-
ediate position between the 50% probability (even chances) and5% probability (probably existing changes). Herewith, the 2-sigmaense (ϑ = 2) can be treated as typical or almost certainly existinghanges and 3-sigma (ϑ = 3) as certainly existing changes.
By (2), the jitter left boundary (JLB) JLl
and the jitter right bound-ry (JRB) JR
lcan be defined with respect to the lth breakpoint nl
imilarly to (10) and (11) as
Ll
∼= nl − kRl , (12)
Rl
∼= nl + kLl , (13)
here
Rl =⌊
�l
�lln
(1 − dl)(1 − ql)�(1 − dlql)
⌋, (14)
Ll =⌊
�l�l ln(1 − dl)(1 − ql)
�(1 − dlql)
⌋(15)
orrespond to the right (superscript “R”) and left (superscript “L”)itter and �x� means a maximum integer lower than or equal to x.unctions (14) and (15) can easily be obtained by equating (2) to
and solving for kl. Here, we allow equal confidence intervals foregments and breakpoints.
. Computational algorithm
The confidence UB mask BUn and LB mask BL
n can now be formedy combining (10)–(13). The relevant algorithm is listed in Table 2.ts inputs are the detected CNVs yn, breakpoint locations nl , bound
ideness ϑ, number L of the breakpoints, and number of the probes. At the output, it has two confidence masks BU
n and BLn.
The first algorithmic block (2–5) computes the segmental statis-ics aj and �j on intervals between neighboring breakpoints. Theecond block (6–17) employs (1–5) to compute the right jitter kR
l
nd left jitter kLl. The third block (18–22) finds the jitter boundaries
l and Dl for the UB and LB masks. The fourth block (23–26) makeorrections to jitter boundaries in the cases when some boundarieserge or overlap. The fifth block (27–33) skips some points in the
ase when the UB mask or LB mask occurs to be uniform for severalreakpoints. The masks BU
n and BLn finally go to the output. Note
hat (2) approximates jitter in the breakpoints of CNVs in the loweround sense. That means that wide jitter boundaries detected byhe algorithm may be even wider in practice.
. Testing of CNVs measurements by the confidence masks
Out next purpose is to test some detected CNVs by the algo-ithm developed in Table 2. We provide such a test in the 3-sigma
ense suggesting that CNVs exist between the UB and LB masksith high probability of P = 99.73%. Our studies are based on someR-CGH array data which are available from the representationalligonucleotide microarray analysis (ROMA) [21]. The breakpointdetectable. The breakpoints i2, i3, i4, i5, i8, i9, and i11 cannot be estimated correctlyowing to large jitter. There is a probability that the breakpoints i2, i3, i4, i5, and i11
do not exist. There is a high probability that the breakpoint i5 does not exist.
locations are also taken from [21]. Voluntary, we select data asso-ciated with potentially large jitter and large segmental errors. Forclarity, we first compute some characteristics of detected CNVs andput them to tables. We notice that our segmental estimates foundby averaging [22] are in a good correspondence with [21].
4.1. Detected data with large jitter
The first database processed is a part of the 7th chromosomein archive “159A–vs–159D–cut” of ROMA. It is shown to have 14segments and 13 breakpoints (Figs. 2 and 3). Below we shall showthat, owing to large detection noise, there is a high probability thatsome breakpoints do not exist.
Observe Fig. 3a and the characteristics collected in Table 3. Here,the only breakpoint which location can be estimated with highaccuracy is i1. Jitter in i6 and i7 is moderate. All other breakpointshave large jitter. It is seen that the UB mask covering 2nd-to-6thsegments is almost uniform. Thus, there is a probability that the2nd-to-5th breakpoints do not exist. If to follow the LB mask, thenlocations of the 2nd-to-4th breakpoints can be predicted even withlarge errors. At least they can be supposed to exist. However, noth-ing definitive can be said about the 5th breakpoint and one maysuppose that it does not exist. It is also hard to distinguish a true
location of the 8th breakpoint. In Fig. 3b, i10, i12, and i13 are welldetectable owing to large segmental SNRs. The breakpoint i9 hasa moderate jitter. In turn, the location of i11 is unclear. Moreover,there is a probability that i11 does not exist.J. Munoz-Minjares et al. / Biomedical Signal Processing and Control 13 (2014) 337–344 341
Table 3CNVs characteristics for the 7th sample from “159A–vs–159D–cut” (ROMA), an average resolution is r = 30 kb. Statistics, boundaries, and jitter parameters are given forlog2 Ratio.
j jth segment Statistics 3-� Jitter 3-�
Initial point, b nj−1 Nj aj �2j
aUBj
aLBj
�j−1 −j−1
+j−1
kLj−1
kRj−1
1 113811463 3085 602 1.069 6.302 1.079 1.060 – – – – –2 i1 = 130613272 3687 81 1.462 25.14 1.516 1.410 0.393 24.47 6.135 1 23 i2 = 132446924 3768 8 1.273 26.79 1.446 1.099 −0.190 1.430 1.342 5 54 i3 = 132633722 3776 127 1.466 27.20 1.510 1.423 0.194 1.401 1.380 5 55 i4 = 135934758 3903 97 1.359 19.86 1.402 1.316 −0.107 0.422 0.578 17 46 i5 = 139176548 4000 67 1.433 37.85 1.504 1.361 0.074 0.273 0.143 – –7 i6 = 140416906 4067 12 1.879 58.37 2.088 1.670 0.446 5.260 3.411 2 38 i7 = 140895309 4079 57 1.371 22.75 1.430 1.311 −0.508 4.428 11.36 2 29 i8 = 142501212 4136 123 1.481 31.77 1.530 1.433 0.111 0.538 0.385 4 21
10 i9 = 146659997 4259 84 1.976 188.7 2.118 1.833 0.494 7.686 1.294 2 511 i10 = 149656964 4343 62 0.636 29.40 0.701 0.570 −1.340 9.512 61.05 1 112 i11 = 151782674 4405 16 0.799 69.63 0.997 0.601 0.163 0.902 0.381 2 3813 i12 = 152348882 4421 80 1.999 156.2 2.131 1.866 1.200 20.68 9.219 1 114 i13 = 155059985 4501 48 0.742 31.54 0.819 0.665 −1.257 10.11 50.06 1 1
Table 4CNVs characteristics consistent to Table 3 for the 2nd sample from “159A–vs–159D–cut” (ROMA).
j jth segment Statistics 3-� Jitter 3-�
Initial point, b nj−1 Nj aj �2j
aUBj
aLBj
�j−1 −j−1
+j−1
kLj−1
kRj−1
1 0.008182 0 515 0.891 11.59 0.906 0.877 – – – – –2 i1 = 1361572 515 4 0.698 0.613 0.735 0.661 −0.193 3.222 60.88 2 13 i2 = 13748132 519 620 0.886 10.57 0.898 0.874 0.188 57.51 3.337 1 24 i3 = 29908634 1139 43 1.051 6.9155 1.089 1.013 0.165 2.580 3.944 4 35 i4 = 31197989 1182 378 0.883 10.52 0.899 0.867 −0.168 4.088 2.686 3 46 i5 = 43670258 1560 31 1.054 7.555 1.101 1.007 0.171 2.775 3.865 4 37 i6 = 44503417 1591 315 0.869 11.65 0.887 0.851 −0.185 4.524 2.935 3 48 i7 = 55133400 1906 9 0.974 15.00 1.097 0.852 0.105 0.954 0.740 5 99 i8 = 55421600 1915 68 0.866 9.916 0.902 0.830 −0.108 0.780 1.180 10 4
10 i9 = 57419734 1983 7 1.070 6.376 1.160 0.979 0.204 4.178 6.499 3 211 i10 = 57834480 1990 99 0.874 9.330 0.903 0.845 −0.195 5.992 4.095 2 312 i11 = 60761193 2089 14 1.000 8.449 1.074 0.927 0.126 1.702 1.879 5 413 i12 = 61183941 2103 219 0.877 10.31 0.898 0.856 −0.123 1.800 1.475 4 514 i13 = 68128478 2322 25 1.064 4.704 1.106 1.023 0.187 3.406 7.468 3 215 i14 = 68866007 2347 8 0.808 5.307 0.885 0.731 −0.257 13.99 12.40 1 116 i15 = 69116139 2355 20 1.051 3.205 1.089 1.013 0.243 11.13 18.46 1 117 i16 = 69558968 2375 148 0.879 11.36 0.905 0.852 −0.172 9.256 2.612 2 418 i17 = 73766518 2523 20 1.001 11.58 1.074 0.929 0.123 1.324 1.298 5 519 i18 = 74221771 2543 40 0.819 14.48 0.876 0.762 −0.183 2.878 2.304 3 420 i19 = 75395596 2583 384 0.887 9.414 0.902 0.873 0.069 0.326 0.502 46 321 i20 = 88410178 2967 32 1.137 11.62 1.194 1.080 0.250 6.632 5.374 2 222 i21 = 93333622 2999 436 1.068 6.468 1.080 1.057 −0.069 0.409 0.735 32 323 i22 = 109227575 3435 20 0.929 6.561 0.983 0.874 −0.140 3.016 2.973 3 324 i23 = 109611296 3455 339 1.058 7.336 1.072 1.044 0.129 2.531 2.264 4 425 i24 = 119366969 3794 65 1.244 7.799 1.276 1.211 0.186 4.713 4.433 3 326 i25 = 121496792 3859 95 0.875 10.48 0.907 0.843 −0.369 17.41 12.96 1 127 i26 = 124208998 3954 3 1.358 4.516 1.475 1.242 0.483 22.29 51.74 1 128 i27 = 124255502 3957 138 0.896 13.68 0.926 0.866 −0.462 47.33 15.62 1 129 i28 = 129730088 4095 3 1.293 10.81 1.473 1.113 0.397 11.53 14.60 1 130 i29 = 130540104 4098 774 0.875 10.45 0.886 0.864 −0.418 16.19 16.73 1 131 i30 = 153359817 4872 21 0.786 8.931 0.847 0.724 −0.089 0.764 0.895 8 532 i31 = 153907066 4893 51 0.867 9.052 0.907 0.827 0.081 0.741 0.731 6 733 i32 = 155689115 4944 76 1.256 8.784 1.288 1.223 0.389 16.67 17.18 1 134 i33 = 157810927 5020 92 1.038 4.855 1.060 1.016 −0.218 5.400 9.771 2 235 i34 = 161040443 5112 14 1.276 8.264 1.349 1.203 0.239 11.72 6.888 2 236 i35 = 161358721 5126 651 1.061 6.034 1.070 1.051 −0.216 5.631 7.71273 2 237 i36 = 180598006 5777 63 1.268 11.25 1.308 1.228 0.208 7.147 3.832 2 338 i37 = 182531303 5840 122 1.064 4.356 1.081 1.046 −0.205 3.766 9.625 3 239 i38 = 188033882 5962 7 1.442 26.54 1.626 1.257 0.378 32.81 5.385 1 240 i39 = 188185029 5969 923 1.058 5.672 1.065 1.051 −0.384 5.543 25.94 2 141 i40 = 217472857 6892 22 1.321 31.500 1.435 1.208 0.262 12.21 2.199 2 442 i41 = 218088420 6914 796 1.049 6.9353 1.058 1.040 −0.272 2.357 11.71 4 2
342 J. Munoz-Minjares et al. / Biomedical Signal Processing and Control 13 (2014) 337–344
some
4
ssspl
sgtm
Flia
Fig. 4. Measurements and estimates of the 2nd chromo
.2. Detected data with large segmental errors
Another database processed corresponds to the 2nd chromo-ome in “159A–vs–159D–cut” of ROMA supposedly having 42egments and 41 breakpoints. This case (Fig. 4) demonstrates largeegmental errors when a segment is represented only with a fewrobes. In turn, only several breakpoints are accompanied here with
arge jitter. Some statistics of this chromosome are given in Table 4.Four specific regions of this chromosome associated with large
egmental errors are sketched in Figs. 5 and 6. Even a quick look sug-ests that in almost all of the cases of short chromosomal changeshe segmental errors reach tens of percents. Moreover, some seg-
ents cannot be estimated at all with a reasonable error. Thus,
ig. 5. UB and LB masks for the estimates of the CNVs shown in Fig. 3: (a) genomicocation from 10 Mb to 45 Mb and (b) genomic location from 50 Mb to 80 Mb. Errorsn the estimates of a1, a6, a14, and a16 exceed 30%. Errors in the estimates of a10, a12,15, and a18 reach 40–50%. There is a probability that changes a2 and a8 do not exist.
taken from the archive “159A–vs–159D–cut” of ROMA.
there is a probability that such changes do not exist. In fact, errorsin the estimates of a4, a6, a14, a16, a25, a33, and a37 exceed 30%. Asituation is even worse with a10, a12, a15, a18, a35, and a39, wherethe estimation errors reach (40–50)% – thus such estimates are notvery useful. Furthermore, there are two segments a2 and a8 whichlevels cannot be estimated correctly and the question arises of theirexistence.
Another specific of this chromosome is that a detected partaround the breakpoint i19 does not contain enough information for
experts. As a consequence, neither a19 nor i19 can be estimated witha reasonably small error. A bit better situation is with a31, i30, andi31. But there is not enough accuracy here as well and the predictedchange associated with a31 may not exist.Fig. 6. UB and LB masks for the estimates of the CNVs shown in Fig. 3: (a) genomiclocation from 119 Mb to 126 Mb and (b) genomic location from 150 Mb to 190 Mb.Errors in the estimates of a25, a33, and a37 exceed 30%. Errors in the estimates of a35
and a39 reach 40–50%.
J. Munoz-Minjares et al. / Biomedical Signal Pro
Fgi
5
5
Fiiamats2Nir
d1UN(sie
5
a
tititinr
5
c
be combined with an efficient detector of the breakpoints. A spe-cial attention should be paid to mathematical justification of theprobability that the detected breakpoint exists in the confidence
ig. 7. Expansions of the UB and LB masks by increasing the probability of existing ofenomic changes around two detected breakpoints. The estimated CNVs are dashedn bold and detected data are dotted.
. Discussion
.1. Segmental errors
Fig. 2 suggests that an average segmental variance is �2 = 53.24.or white segmental noise, it can be reduced by averaging as �2/Nl
n order to reach the standard deviation of 0.243 (we think that its acceptable) over Nl = 900 probes. It turns out however that onlyˆ1 is observed with N1 = 602 probes and all other segments have
uch smaller Nl. Hence, most of segments will suffer of large errorsnd higher probe resolution will be required. It was shown in [4]hat the CNVs may be about 1 kb in size. In order to obtain smallegmental errors, the probes must thus have a resolution at least00 bp as reported in [7] and even higher that is available usingGS technologies. For the detection resolution of 5–10 kb reported
n [6] and 30 kb in [21], an estimator will not be able to provide aealistic CNVs picture, because some short segments will be lost.
The CNVs estimation error can be supposed to be negligible if itoes not exceed a threshold of ±0.01 or 1% of an average change of.0 in log2 Ratio. For the average variance �2 = 53.24 in Fig. 2, theB and LB suggest that the estimate will stay below this threshold ifl = 4791. However, a maximal segmental length in Fig. 4 is Nl = 923
see a40) that corresponds to the threshold of about ±0.023. Otheregments require larger and even much larger thresholds. So, theres no one genomic change in Fig. 2 which can be estimated with 1%rror. In the best case, the estimation error exceeds 2.3%.
.2. Jitter in the breakpoints
It follows from Tables 3 and 4 that a minimal jitter of ±1 points ischieved with > 10 or 10 dB. However, most frequently we meet∼= 1 or 0 dB that causes the jitter error of several points. Some-
imes, occurs to be small only in one of the segments (i39 and i40n Table 4). This does not lead to a substantial difference betweenhe left and right jitters. But, if the SNRs fall essentially below unityn both segments, jitter may reach tens of points and nothing defini-ive can be said about the breakpoint location. This is the case of5 in Table 3 and Fig. 4a. So, an acceptable jitter of ±1 points isot common for the HR-CGH arrays-based probes. Typically, jittereaches here several points.
.3. Basic properties of confidence UB and LB masks
Based on the above-provided analysis, we come up with aonclusion that the confidence UB and LB masks can serve as a
cessing and Control 13 (2014) 337–344 343
powerful tool for medical applications owing to the following basicproperties:
• The true CNVs exist between BUn and BL
n with the required prob-ability P(Nl).
• If either BUn or BL
n covering two or more breakpoints occurs to beuniform, then there is a probability of no changes in this region.
• If both BUn and BL
n covering two or more breakpoints occur to beuniform, then there is a high probability of no changes in thisregion.
We finally learn effect of the confidence interval on the UB andLB masks. To illustrate, we consider a part of detected CNVs (bolddots in Fig. 7) which consists of two breakpoints and is highly con-taminated by noise. We then compute the masks using algorithm(Table 2) for several most common probabilities listed in Table 1.Fig. 7 shows that the 1-sigma sense (68.27%) is closely relatedto probable existence (75%). With such probabilities, jitter in thebreakpoints does not exist and segmental changes are estimatedat an acceptable error level of about 0.1. If to increase the prob-ability and pass through the certain existence and 2-sigma senseto the 3-sigma sense (99.73%), then the UB and LB stretch so thatthe genomic changes cannot be said to exist with acceptable errors.This example neatly demonstrates that a final deduction about theCNVs structure will strongly depend on the confidence interval.Further investigations are thus required in order to specify thisinterval via the probability P which is most acceptable for medicalconclusions.
6. Conclusions
Modern technologies such as NGS allow detecting genomechanges with probe resolution of about 0.8–6 kb in white Gaussiannoise having segmental SNRs around unity. Under such conditions,estimates of segmental changes and breakpoint locations are oftenaccompanied with large errors, especially if genome changes haveabout 1 kb in length. In view of large noise, no one estimator evenideal is able to provide jitter-free detection of the breakpoint loca-tions and error-free estimation of segmental changes.
The confidence UB and LB masks proposed in this paper out-line a region of the endpoints within which the CNVs estimatesexist with a given probability. The masks can thus serve as anauxiliary tool for medical experts to make decisions about CNVsstructures. The algorithm designed is able to produce the masksfor any confidence interval. Testing some HR-CGH array data inthe 3-sigma sense (confidence interval of 99.73%) has revealederrors of (30 . . . 50)% and larger in many segments. Jitter was alsoshown to be large in some breakpoints. Moreover, it was indicatedthat there is a probability that some changes predicted by ROMAmay not exist. By reducing the confidential interval, the bound-aries inherently squeeze around the predicted changes. However,the masks lose their usefulness when the confidential intervalreaches 50%.
Based upon this investigation, we end up with a conclusionthat further works should be focused on optimizing the confidenceinterval in order to meet medical needs. Also, the algorithm should
masks. Finally, a more accurate approximation of the jitter distri-bution needs to be found as long as the discrete skew Laplace lawdoes not fit the process well with small SNRs. We work on it nowand expect presenting some results in near future.
3 nal Pro
R
[
[
[
[
[
[
[
[
[
[
[
[
[
[
44 J. Munoz-Minjares et al. / Biomedical Sig
eferences
[1] A.F. Cockburn, M.J. Newkirk, R.A. Firtel, Organization of the RNA genes of dic-tyostelium: mapping of the nontrascribed spacer regions, Cell 9 (4) (1976)605–613, Part 1.
[2] F.S. Collins, E.S. Lander, J. Rogers, R.H. Waterson, Finishing the euchromaticsequence of the human genome, Nature 431 (7011) (2004) 931–945.
[3] R. Redon, S. Ishikawa, K.R. Fitch, L. Feuk, G.H. Perry, T.D. Andrews, H. Fiegler,M.H. Shapero, A.R. Carson, W. Chen, E.K. Cho, S. Dallaire, J.L. Freeman, J.R. Gon-zalez, M. Gratacos, J. Huang, D. Kalaitzopoulos, D. Komura, J.R. MacDonald,C.R. Marshall, R. Mei, L. Montgomery, K. Nishimura, K. Okamura, F. Shen, M.J.Somerville, J. Tchinda, A. Valsesia1, C. Woodwark, F. Yang, J. Zhang, T. Zerjal, J.Zhang, L. Armengol, D.F. Conrad, X. Estivill, C. Tyler-Smith, N.P. Carter, H. Abu-ratani, C. Lee, K.W. Jones, S.W. Scherer, M.E. Hurles, Global variation in copynumber in the human genome, Nature 444 (2006) 444–454.
[4] P. Stankiewicz, J.R. Lupski, Structural variation in the human genome and itsrole in disease, Ann. Rev. Med. 61 (2010) 437–455.
[5] S. Yoon, Z. Xuan, V. Makarov, K. Ye, J. Sebat, Sensitive and accurate detectionof copy number variants using real depth of coverage, Genome Res. 19 (2009)1586–1592.
[6] H. Ren, W. Francis, A. Boys, A.C. Chueh, N. Wong, P. La, L.H. Wong, J. Ryan,H.R. Slater, K.H.A. Choo, BAC-based PCR fragment microarray: high-resolutiondetection of chromosomal deletion and duplication breakpoints, Hum. Mutat.25 (5) (2005) 476–482.
[7] A.E. Urban, J.O. Korbel, R. Selzer, T. Richmond, A. Hacker, G.V. Popescu, J.F.Cubells, R. Green, B.S. Emanuel, M.B. Gerstein, S.M. Weissman, M. Snyder, High-resolution mapping of DNA copy alterations in human chromosome 22 usinghigh-density tiling oligonucleotide arrays, Proc. Natl. Acad. Sci. (PNAS) 103 (12)(2006) 4534–4539.
[8] J.O. Korbel, A.E. Urban, J.P. Affourtit, B. Godwin, F. Grubert, J.F. Simons, P.M. Kim,D. Palejev, N.J. Carriero, L. Du, B.E. Taillon, Z. Chen, A. Tanzer, A.C.E. Saunders,J. Chi, F. Yang, N.P. Carter, M.E. Hurles, S.M. Weissman, T.T. Harkins, M.B. Ger-stein, M. Egholm, M. Snyder, Paired-end mapping reveals extensive structuralvariation in the human genome, Science 318 (5849) (2007) 420–426.
[9] C. Xie, M.T. Tammi, CNV-seq, a new method to detect copy number variantionusing high-throughput sequencing, BMC Bioinform. 10 (80) (2009) 1–9.
10] S. Ivakhno, T. Royce, A.J. Cox, D.J. Evers, R.K. Cheetham, S. Tavaré, CNAseg– a novel framework for identification of copy number changes in cancer
from second-generation sequencing data, Bioinformatics 26 (24) (2010) 3051–3058.11] V. Boeva V., A. Zinovyev, K. Bleakley, J.P. Vert, I. Janoueix-Lerosey, O. Delattre,E. Barillot, Control-free calling of copy number alterations in deep-sequencingdata using GC-content normalization, Bioinformatics 27 (2) (2011) 268–269.
[
[
cessing and Control 13 (2014) 337–344
12] A. Gusnanto, H.M. Wood, Y. Pawitan, P. Rabbitts, S. Berri, Correcting for cancergenome size and tumour cell content enables better estimation of copy num-ber alterations from next-generation sequence data, Bioinformatics 28 (2012)40–47.
13] J. Duan, J.G. Zhang, H.W. Deng, Y.P. Wang, Comparative studies of copy num-ber variation detection methods for next-generation sequencing technologies,PLoS ONE 8 (3) (2013) 1–12, e59128.
14] Y.H. Yang, S. Dudoit, P. Luu, D.M. Lin, V.P. Peng, J. Ngai, T.P. Speed, Normalizationfor cDNA microarray data: a robust composite method addressing single andmultiple slide systematic variation, Nucleic Acids Res. 30 (4) (2002) 1–10.
15] C. Zong, S. Lu, A.R. Chapman, X.S. Xie, Genome-wide detection of single-nucleotide and copy-number variations of a single human cell, Science 338(2012) 1622–1626.
16] J. Munoz-Minjares, J. Cabal-Aragón, Y.S. Shmaliy, Effect of noise on estimates ofstepwise changes in genome DNA chromosomal systems, WSEAS Trans. Biol.Biomed. 11 (2014) 52–61.
17] W.R. Lai, M.D. Johnson, R. Kucherlapati, P.J. Park, Comparative analysis ofalgorithms for identifying amplifications and deletions in array CGH data,Bioinformatics 21 (19) (2005) 3763–3770.
18] J. Munoz-Minjares, Y.S. Shmaliy, J. Cabal-Aragón, Confidence limits for genomeDNA copy number variations in HR-CGH array measurements, Biomed. SignalProcess. Control 10 (2014) 166–173.
19] F. Picard, S. Robin, M. Lavielle, C. Vaisse, J.J. Daudin, A statistical approach forarray CGH data analysis, BMC Bioinform. 6 (27) (2005) 1–14.
20] J. Munoz-Minjares, J. Cabal-Aragon, Y.S. Shmaliy, Jitter probability in the break-points of discrete sparse piecewise-constant signals, in: Proc. 21st EuropeanSignal Process. Conf. (EUSIPCO-2013), 2013, pp. 1–5.
21] R. Lucito, J. Healy, J. Alexander, A. Reiner, D. Esposito, M. Chi, L. Rodgers, A.Brady, J. Sebat, J. Troge, J.A. West, S. Rostan, K.C.Q. Nguyen, S. Powers, K.Q. Ye, A.Olshen, E. Venkatraman, L. Norton, M. Wigler, Representational oligonucleotidemicroarray analysis: a high-resolution method to detect genome copy numbervariation, Genome Res. 10 (2003) 2291–2305.
22] J. Munoz-Minjares, J. Cabal-Aragon, Y.S. Shmaliy, Probabilistic bounds for esti-mates of genome DNA copy number variations using HR-CGH microarrays, in:Proc. 21st European Signal Process. Conf. (EUSIPCO-2013), 2013, pp. 1–5.
23] J. Wu, K.R. Grzeda, C. Stewart, F. Grubert, A.E. Urban, M.P. Snyder, G.T. Marth,Copy number variation detection from 1000 genomes project exon capturesequencing data, BMC Bioinform. 13 (305) (2012) 1–19.
24] J.H. Chu, A. Rogers, I. Ionita-Laza, K. Darvishi, R.E. Mills, C. Lee, B.A. Raby, Copynumber variation genotyping using family information, BMC Bioinform. 14(157) (2013) 1–11.
25] T.J. Kozubowski, S. Inusah, A skew Laplace distribution on integers, Ann. Inst.Stat. Math. 58 (2006) 555–571.