8
Biomedical Signal Processing and Control 13 (2014) 337–344 Contents lists available at ScienceDirect Biomedical Signal Processing and Control jo ur nal homepage: www.elsevier.com/locate/bspc Technical Note Confidence masks for genome DNA copy number variations in applications to HR-CGH array measurements Jorge Mu˜ noz-Minjares, Jesús Cabal-Aragón, Yuriy S. Shmaliy Universidad de Guanajuato, Department of Electronics Engineering, Salamanca, 36885 Gto., Mexico a r t i c l e i n f o Article history: Received 21 November 2013 Received in revised form 21 May 2014 Accepted 19 June 2014 Keywords: Genome copy number variations Confidence limit masks HR-CGH microarray a b s t r a c t The array-comparative genomic hybridization (aCGH) and next generation sequence technologies enable cost-efficient high resolution detection of DNA copy number variations (CNVs). However, while the CNVs estimates provided by different methods are often inconsistent with each other, still a little can be found about the estimation errors. Based on our recent studies of the confidence limits for stepwise signals measured in noise, we develop an efficient algorithm for computing the confidence upper and lower boundary masks in order to guarantee an existence of genomic changes with required probability. We suggest combining these masks with estimates in order to give medical experts more information about true CNVs structures. Applications given for high-resolution CGH microarray measurements ensure that there is a probability that some changes predicted by an estimator may not exist. © 2014 Elsevier Ltd. All rights reserved. 1. Introduction The deoxyribonucleic acid (DNA) of a genome is commonly recognized to be essential for human life. The DNA is usually double-stranded, therefore the size of a gene or chromosome is often measured in base pairs. A unit of measurement is kilobase (kb) equal to 1000 bp of DNA [1]. The DNA can demonstrate structural changes called copy-number variations (CNVs) associated with dis- ease such as cancer [2]. The copy number variation (CNV) can be defined as a DNA segment of one kb or larger that is present at a variable copy number in comparison with a reference genome [3]. The human genome with 23 chromosomes is estimated to be about 3.2 billion base pairs long and to contain 20,000–25,000 dis- tinct genes [2]. It is known [4] that each copy number variation (CNV) may range from about 1 kb to several megabases (Mbs) in size. To detect the CNVs at a resolution level of 10–25 kbs [5], the array-comparative genomic hybridization (aCGH) technique was developed employing chromosomal microarray analysis. Although it was reported that the high-resolution CGH (HR-CGH) arrays are accurate to detect structural variations at resolution of 5 kbs [6] and even 200 bp [7], their regular resolution is still insufficient to detect short genomic changes. A progress was achieved in the last few years following the emergence of the next generation sequence (NGS) technologies which allows for the detection of CNVs with Corresponding author. Tel.: +52 464 647 01 95; fax: +52 464 647 24 00. E-mail addresses: [email protected], [email protected] (Y.S. Shmaliy). resolution < 10 kbp [8]. The NGS approach has generated extensive developments of the CNVs detection methods [9–12] and sev- eral such methods obtaining resolution of 0.8–6 kb were recently reviewed in [13]. Even though the conceptual steps in aCGH and CNV-seq meth- ods are different, the outputs are typically represented in the same scales [9]. The genomic location is often given with the nth probes, n [1, M], where M is the number of probes following with a unit step ignoring “bad” or empty probes. The n l th discrete point corre- sponds to the i l th edge (breakpoint) in the genomic location scale in kb or Mb which is finally used to represent the CNVs. The break- points are placed as 0 < n 1 < · · · < n L < M, where n l , l [1, L], is the lth breakpoint and L is the number of the breakpoints. The CNVs are represented with L + 1 segmental constant changes a j , j [1, L + 1], characterizing a segment between i j1 and i j on an interval [i j1 , i j 1]. Typically, the CNVs are normalized as log 2 R/G = log 2 Ratio, where R and G are the fluorescent Red and Green intensities, respec- tively [14]. Genome CNVs are stepwise sparse with a limited number of breakpoints [15]. The detected structure is usually contaminated by intensive noise [16]. In the log 2 Ratio scale, noise is commonly modeled as white Gaussian with equal or different segmental vari- ances. Various statistical approaches have been developed in order to estimate CNVs in array CGH and NGS data. Two main goals are [17]: (1) to infer the number and statistical significance of the alterations and (2) to locate their boundaries accurately. A num- ber of statistical methods were tested by CNVs measurements as observed in [18], including wavelet-based, robust, adaptive kernel http://dx.doi.org/10.1016/j.bspc.2014.06.006 1746-8094/© 2014 Elsevier Ltd. All rights reserved.

Confidence masks for genome DNA copy number variations in applications to HR-CGH array measurements

Embed Size (px)

Citation preview

T

Ca

JU

a

ARRA

KGCH

1

rdoeceda[at(sadiaadf(

h1

Biomedical Signal Processing and Control 13 (2014) 337–344

Contents lists available at ScienceDirect

Biomedical Signal Processing and Control

jo ur nal homepage: www.elsev ier .com/ locate /bspc

echnical Note

onfidence masks for genome DNA copy number variations inpplications to HR-CGH array measurements

orge Munoz-Minjares, Jesús Cabal-Aragón, Yuriy S. Shmaliy ∗

niversidad de Guanajuato, Department of Electronics Engineering, Salamanca, 36885 Gto., Mexico

r t i c l e i n f o

rticle history:eceived 21 November 2013eceived in revised form 21 May 2014ccepted 19 June 2014

a b s t r a c t

The array-comparative genomic hybridization (aCGH) and next generation sequence technologies enablecost-efficient high resolution detection of DNA copy number variations (CNVs). However, while the CNVsestimates provided by different methods are often inconsistent with each other, still a little can be foundabout the estimation errors. Based on our recent studies of the confidence limits for stepwise signals

eywords:enome copy number variationsonfidence limit masksR-CGH microarray

measured in noise, we develop an efficient algorithm for computing the confidence upper and lowerboundary masks in order to guarantee an existence of genomic changes with required probability. Wesuggest combining these masks with estimates in order to give medical experts more information abouttrue CNVs structures. Applications given for high-resolution CGH microarray measurements ensure thatthere is a probability that some changes predicted by an estimator may not exist.

© 2014 Elsevier Ltd. All rights reserved.

. Introduction

The deoxyribonucleic acid (DNA) of a genome is commonlyecognized to be essential for human life. The DNA is usuallyouble-stranded, therefore the size of a gene or chromosome isften measured in base pairs. A unit of measurement is kilobase (kb)qual to 1000 bp of DNA [1]. The DNA can demonstrate structuralhanges called copy-number variations (CNVs) associated with dis-ase such as cancer [2]. The copy number variation (CNV) can beefined as a DNA segment of one kb or larger that is present at

variable copy number in comparison with a reference genome3]. The human genome with 23 chromosomes is estimated to bebout 3.2 billion base pairs long and to contain 20,000–25,000 dis-inct genes [2]. It is known [4] that each copy number variationCNV) may range from about 1 kb to several megabases (Mbs) inize. To detect the CNVs at a resolution level of 10–25 kbs [5], therray-comparative genomic hybridization (aCGH) technique waseveloped employing chromosomal microarray analysis. Although

t was reported that the high-resolution CGH (HR-CGH) arrays areccurate to detect structural variations at resolution of 5 kbs [6]nd even 200 bp [7], their regular resolution is still insufficient to

etect short genomic changes. A progress was achieved in the lastew years following the emergence of the next generation sequenceNGS) technologies which allows for the detection of CNVs with

∗ Corresponding author. Tel.: +52 464 647 01 95; fax: +52 464 647 24 00.E-mail addresses: [email protected], [email protected] (Y.S. Shmaliy).

ttp://dx.doi.org/10.1016/j.bspc.2014.06.006746-8094/© 2014 Elsevier Ltd. All rights reserved.

resolution < 10 kbp [8]. The NGS approach has generated extensivedevelopments of the CNVs detection methods [9–12] and sev-eral such methods obtaining resolution of 0.8–6 kb were recentlyreviewed in [13].

Even though the conceptual steps in aCGH and CNV-seq meth-ods are different, the outputs are typically represented in the samescales [9]. The genomic location is often given with the nth probes,n ∈ [1, M], where M is the number of probes following with a unitstep ignoring “bad” or empty probes. The nlth discrete point corre-sponds to the ilth edge (breakpoint) in the genomic location scalein kb or Mb which is finally used to represent the CNVs. The break-points are placed as 0 < n1 < · · · < nL < M, where nl, l ∈ [1, L], is the lthbreakpoint and L is the number of the breakpoints. The CNVs arerepresented with L + 1 segmental constant changes aj, j ∈ [1, L + 1],characterizing a segment between ij−1 and ij on an interval [ij−1,ij − 1]. Typically, the CNVs are normalized as log 2 R/G = log 2 Ratio,where R and G are the fluorescent Red and Green intensities, respec-tively [14].

Genome CNVs are stepwise sparse with a limited number ofbreakpoints [15]. The detected structure is usually contaminatedby intensive noise [16]. In the log 2 Ratio scale, noise is commonlymodeled as white Gaussian with equal or different segmental vari-ances. Various statistical approaches have been developed in orderto estimate CNVs in array CGH and NGS data. Two main goals are

[17]: (1) to infer the number and statistical significance of thealterations and (2) to locate their boundaries accurately. A num-ber of statistical methods were tested by CNVs measurements asobserved in [18], including wavelet-based, robust, adaptive kernel

3 nal Processing and Control 13 (2014) 337–344

saomr

ngleaffbbsho

aapn(ioUmgpS

2

ipafta

2

atnd

p

ibtpttd

p

w√

Fig. 1. Genomic changes with a single breakpoint at nl: (a) log2 Ratio, (b) segmentalGaussian distributions with different variances, and (c) skew Laplace jitter distribu-tion in the breakpoint at nl .

Table 1Probability measures for genomic changes.

ϑ P (%) � (%)

Even chances 0.6745 50 501-Sigma 1 68.27 31.73Probable 1.15035 75 25Almost certain 1.81191 93 7Typical confidence 1.96 95 52-Sigma 2 95.45 4.553-Sigma 3 99.73 0.27

38 J. Munoz-Minjares et al. / Biomedical Sig

moothers, maximum likelihood (ML), penalized bridge estimatornd ridge regression, fussed least-absolute shrinkage and selectionperator (Lasso), the Schwarz information criterion-based esti-ator, and forward-backward smoothers. Quite comprehensive

eviews of most common algorithms were given in [17,13].It has to be remarked now that, in view of large detection noise,

o one estimator even ideal is able to provide a clear picture ofenomic changes [17]: the estimates are often accompanied witharge segmental errors and jitter in the breakpoints. Thus, medicalxperts may have insufficient information for a correct decisionbout the true CNVs structure [15]. Even so, still a little can beound in literature about the estimation errors, disregarding theact that an existence of large jitter in the CNV breakpoints haseen shown experimentally in [19]. The problem is complicatedy the fact that exact jitter distribution is still unknown for suchignals even in white Gaussian noise. Just recently, in [18,20], weave derived an approximate jitter distribution and showed that itbeys the discrete skew Laplace law.

In this paper, we introduce a statistical framework and developn efficient algorithm for computing the confidence lower bound-ry (LB) and upper boundary (UB) masks for CNVs. The masksroposed can be applied to measurements conducted by any tech-ology, although we give applications only to high-resolution CGHHR CGH) microarray data available from [21]. The rest of the papers organized as follows. In Section 2, we consider a statistical modelf genomic changers. A computational algorithm for the confidenceB and LB masks is developed in Section 3. Testing of some HR CGHicroarray-based CNVs measurements by the confidence masks is

iven in Section 4. Discussions of segmental errors and jitter arerovided in Section 5. Finally, concluding remarks can be found inection 6.

. Confidence masks for genomic changes

Estimation error bounds for genome DNA CNVs can be learnedf to specify statistically noise in segments and jitter in the break-oints [22]. Based upon the statistical properties of segmental noisend jitter in the breakpoints, the confidence UB and LB masks can beormalized for CNVs to exist with some probability. The masks, inurn, can serve as additional measures for experts to make decisionsbout possible genomic changes.

.1. Statistical modeling of detected genomic changes

A typical genomic change detected around the lth breakpointt nl can be illustrated as shown in Fig. 1a. One recognizes herewo segments at levels al and al+1. In the log2 Ratio scale, segmentaloise is often supposed to be white [23] and modeled with Gaussianensity [24]

l(x) = 1√2��2

l

exp

[− (x − al)

2

2�2l

](1)

n which the segmental noise variances �2l

along the genome cane different [15] as shown in Fig. 1b. In the presence of segmen-al noise, the breakpoint location cannot be predicted with unitrobability. An estimator may find it several points to the left or tohe right from a true location – thus jitter. We have shown in [18]hat the jitter distribution measured in [19] can approximately beescribed with the discrete skew Laplace law [25] (Fig. 1c)

(k|d , q ) = (1 − dl)(1 − ql){

dkl, k ≥ 0,

(2)

l l 1 − dlql q|k|l

, k ≤ 0,

here dl = e−(�l/�l) ∈ (0, 1), ql = e−(1/�l�l) ∈ (0, 1), �l =(ln xl/(ln(xl/�l))), and �l = − (�l/ln xl) > 0. Auxiliary functions

Certain ∞ 100 0

connected with the probabilities for each detected point to belongto one segment or another are specified as

xl = l(1 + �l)2(1 + l)

(1 −√

1 + 4�l(1 − 2l)

2l(1 + �l)

2

), (3)

�l = PAl

(1 − PBl

)

PBl

(1 − PAl

), (4)

l = PAl

+ PBl

− 1

(1 − 2PAl

)(1 − 2PBl

), (5)

PAl =

⎧⎪⎪⎪⎪⎨⎪⎪⎪

1 + 12

[erf(gˇl

) − erf(g˛l )], −

l< +

l,

12

erfc(g˛1 ), −

l= +

l, (6)

⎪⎩ 12

[erf(gˇl

) − erf(g˛l )], −

l> +

l,

J. Munoz-Minjares et al. / Biomedical Signal Processing and Control 13 (2014) 337–344 339

omosome taken from the archive “159A–vs–159D–cut” of ROMA.

P

w√ia

˛

i

m

w

(etma

2

twNt

Ens

a

Table 2Algorithm for computing the confidence UB mask BU

n and LB mask BLn via detected

CNVs yn and given breakpoint locations nl for the bound wideness ϑ.

Input: yn , nl , ϑ, L, M1: NL+1 = M − nL , n0 = 0, � = erfc(ϑ/

√2)

2: for j = 1 : L + 1 do3: Nj = nj − nj−1, aj = 1

Nj

∑nj−1

v=nj−1yv

4: �j =√

1Nj

∑nj−1

v=nj−1(yv − aj)

2

5: and for6: for l = 1 : L do

7: �l = al+1 − al , −l

= �2l

�2l

, +l

= �2l

�2l+1

8: ˛l by (8) with “ − ′′ and al = al

9: ˇl by (8) with “ + ′′ and al = al

10: gˇl

= ˇl−al|�l |

√−

l2 , g˛

l= ˛l−al

|�l |

√−

l2

11: hˇl

= ˇl−al|�l |

√+

l2 , h˛

l= ˛l−al

|�l |

√+

l2

12: PAl

by (6), PBl

by (7), l by (5)

13: �l = PAl

(1−PBl

)

PBl

(1−PAl

), xl by (3), �l =

√ln xl

ln(xl/�l ),

14: �l = − �lln(xl )

, dl = e− �l�l , ql = e− 1

�l�l

15: kRl

by (14), � right jitter16: kL

lby (15), � left jitter

17: and for18: CL+1 = M − 1, DL+1 = M − 119: for l = 1 : L do

20: Cl ={

nl − kRl

if �l > 0nl + kL

lif �l < 0

21: Dl ={

nl + kLl

if �l > 0nl − kR

lif �l < 0

22: and for23: for l = 1 : L do

24: Cl ={ Cl if Im Cl = 0

Cl−1 if �l ≥ 0 and Im Cl /= 0Cl+1 if �l < 0 and Im Cl /= 0

25: Dl ={ Dl if Im Dl = 0

Dl+1 if �l ≥ 0 and Im Dl /= 0Dl−1 if �l < 0 and Im Dl /= 0

26: and for27: l = 1, k = 128: for n = 0 : M − 1 do

29: l ={

l if n < Cl

l + 1 if n � Cl and Cl+1 > Cl

l + 2 if n � Cl and Cl+1 � Cl

30: k ={

k if n < Dl

k + 1 if n � Dl and Dl+1 > Cl

k + 2 if n � Dl and Dl+1 � Cl

31: BUn = al + ϑ

√�2

lNl

� UB mask√

Fig. 2. Measurements and estimates of a part of the 7th chr

Bl =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

12

[erf(h˛l ) − erf(hˇ

l)], −

l< +

l,

1 − 12

erfc(h˛l ), −

l= +

l,

1 + 12

[erf(h˛l ) − erf(hˇ

l)], −

l> +

l,

(7)

here gˇl

= ((ˇl − �l)/|�l|)√

(−l

/2), g˛l

= ((˛l − �l)/|�l|)(−

l/2), hˇ

l= (ˇl/|�l|)

√(+

l/2), h˛

l= (˛l/|�l|)

√(+

l/2), erf(x)

s the error function, erfc(x) is the complementary error function,nd

l, ˇl = al−l

− al+1+l

−l

− +l

∓ 1−

l− +

l

×

√√√√(al − al+1)−l

+l

+ 2�2l(−

l− +

l) ln

√−

l

+l

(8)

f −l

/= +l

. For −l

= +l

, set ˛l = �l/2 and ˇl =± ∞.The signal-to-noise rations (SNRs) in the lth and (l + 1)th seg-

ents are specified as, respectively,

−l

= �2l

�2l

, +l

= �2l

�2l+1

, (9)

here the segmental change is �l = al+1 − al.Distributions (1) and (2) suggest that due to segmental noise

Fig. 1a) there is always a probability (often high) of segmentalrrors (often large) and jitter in the breakpoints (sometimes essen-ial) irrespective of the estimator used. Of prime interest is thus how

uch are these errors and which confidence limit masks guaranteen existence of genomic changes with sufficient probability.

.2. Confidence UB and LB masks

Let us suppose that the estimate nl of the lth breakpoint loca-ion is available. At least, it can be assigned visually. In view ofhite nature of noise, simple averaging applied on an interval ofl = nl − nl−1 points from nl−1 to nl − 1 gives the best estimate for

he lth segmental level al = (1/Nl)∑nl−1

v=nl−1yv, which mean value is

{al} = al and variance �2l

= (�2l

/Nl). Because �2l

is commonly notegligible, segmental errors occur. The confidence UB and LB for

egmental estimates can thus be specified in the ϑ-sigma sense as

ˆUBl

∼= al + ε = al + ϑ

√�2

l

Nl= al + ϑ �l, (10)

32: BLn = ak − ϑ

�2k

Nk� LB mask

33: and forOutput: BU

n , BLn

3 nal Processing and Control 13 (2014) 337–344

a

widu

etnges(m7sc

as

J

J

w

k

k

cjF�s

3

bIwM

tsaCcmcbtbt

4

rswHo

Fig. 3. UB mask and LB mask for the estimates of the CNVs shown in Fig. 1: (a)genomic location from 130 Mb to 146 Mb and (b) genomic location from 146 Mb to156 Mb. Jitter in i1, i6, i7, i9, i10, i12, and i13 is moderate and these breakpoints are well

40 J. Munoz-Minjares et al. / Biomedical Sig

ˆLBl

∼= al − ε = al − ϑ

√�2

l

Nl= al − ϑ �l, (11)

here ϑ indicates the bound wideness in terms of �l . The probabil-ty � for the segmental estimate to exceed a threshold ε stronglyepends on the segmental length Nl and can be determined,sing (1) and al , as �(Nl) = 2

∫ ∞al+ε

pl(x)dx = erfc(�l

√(Nl/2)), where

rfc(x) is the complementary error function and �l = (ε/√

�2l

) ishe normalized threshold. A distinctive feature of � is that it doesot depend on the unknown al. By combining ε and �l in �(Nl) weet �(Nl) = erfc(ϑ/

√2) and the confidence interval for segmental

stimate becomes P(Nl) = 1 − �(Nl) = 1 − erfc(ϑ/√

2). Table 1 giveseveral values of ϑ, P, and � for likely existing genomic changes> 50 %). As can be seen, the 1-sigma sense (ϑ = 1) occupies an inter-

ediate position between the 50% probability (even chances) and5% probability (probably existing changes). Herewith, the 2-sigmaense (ϑ = 2) can be treated as typical or almost certainly existinghanges and 3-sigma (ϑ = 3) as certainly existing changes.

By (2), the jitter left boundary (JLB) JLl

and the jitter right bound-ry (JRB) JR

lcan be defined with respect to the lth breakpoint nl

imilarly to (10) and (11) as

Ll

∼= nl − kRl , (12)

Rl

∼= nl + kLl , (13)

here

Rl =⌊

�l

�lln

(1 − dl)(1 − ql)�(1 − dlql)

⌋, (14)

Ll =⌊

�l�l ln(1 − dl)(1 − ql)

�(1 − dlql)

⌋(15)

orrespond to the right (superscript “R”) and left (superscript “L”)itter and �x� means a maximum integer lower than or equal to x.unctions (14) and (15) can easily be obtained by equating (2) to

and solving for kl. Here, we allow equal confidence intervals foregments and breakpoints.

. Computational algorithm

The confidence UB mask BUn and LB mask BL

n can now be formedy combining (10)–(13). The relevant algorithm is listed in Table 2.ts inputs are the detected CNVs yn, breakpoint locations nl , bound

ideness ϑ, number L of the breakpoints, and number of the probes. At the output, it has two confidence masks BU

n and BLn.

The first algorithmic block (2–5) computes the segmental statis-ics aj and �j on intervals between neighboring breakpoints. Theecond block (6–17) employs (1–5) to compute the right jitter kR

l

nd left jitter kLl. The third block (18–22) finds the jitter boundaries

l and Dl for the UB and LB masks. The fourth block (23–26) makeorrections to jitter boundaries in the cases when some boundarieserge or overlap. The fifth block (27–33) skips some points in the

ase when the UB mask or LB mask occurs to be uniform for severalreakpoints. The masks BU

n and BLn finally go to the output. Note

hat (2) approximates jitter in the breakpoints of CNVs in the loweround sense. That means that wide jitter boundaries detected byhe algorithm may be even wider in practice.

. Testing of CNVs measurements by the confidence masks

Out next purpose is to test some detected CNVs by the algo-ithm developed in Table 2. We provide such a test in the 3-sigma

ense suggesting that CNVs exist between the UB and LB masksith high probability of P = 99.73%. Our studies are based on someR-CGH array data which are available from the representationalligonucleotide microarray analysis (ROMA) [21]. The breakpoint

detectable. The breakpoints i2, i3, i4, i5, i8, i9, and i11 cannot be estimated correctlyowing to large jitter. There is a probability that the breakpoints i2, i3, i4, i5, and i11

do not exist. There is a high probability that the breakpoint i5 does not exist.

locations are also taken from [21]. Voluntary, we select data asso-ciated with potentially large jitter and large segmental errors. Forclarity, we first compute some characteristics of detected CNVs andput them to tables. We notice that our segmental estimates foundby averaging [22] are in a good correspondence with [21].

4.1. Detected data with large jitter

The first database processed is a part of the 7th chromosomein archive “159A–vs–159D–cut” of ROMA. It is shown to have 14segments and 13 breakpoints (Figs. 2 and 3). Below we shall showthat, owing to large detection noise, there is a high probability thatsome breakpoints do not exist.

Observe Fig. 3a and the characteristics collected in Table 3. Here,the only breakpoint which location can be estimated with highaccuracy is i1. Jitter in i6 and i7 is moderate. All other breakpointshave large jitter. It is seen that the UB mask covering 2nd-to-6thsegments is almost uniform. Thus, there is a probability that the2nd-to-5th breakpoints do not exist. If to follow the LB mask, thenlocations of the 2nd-to-4th breakpoints can be predicted even withlarge errors. At least they can be supposed to exist. However, noth-ing definitive can be said about the 5th breakpoint and one maysuppose that it does not exist. It is also hard to distinguish a true

location of the 8th breakpoint. In Fig. 3b, i10, i12, and i13 are welldetectable owing to large segmental SNRs. The breakpoint i9 hasa moderate jitter. In turn, the location of i11 is unclear. Moreover,there is a probability that i11 does not exist.

J. Munoz-Minjares et al. / Biomedical Signal Processing and Control 13 (2014) 337–344 341

Table 3CNVs characteristics for the 7th sample from “159A–vs–159D–cut” (ROMA), an average resolution is r = 30 kb. Statistics, boundaries, and jitter parameters are given forlog2 Ratio.

j jth segment Statistics 3-� Jitter 3-�

Initial point, b nj−1 Nj aj �2j

aUBj

aLBj

�j−1 −j−1

+j−1

kLj−1

kRj−1

1 113811463 3085 602 1.069 6.302 1.079 1.060 – – – – –2 i1 = 130613272 3687 81 1.462 25.14 1.516 1.410 0.393 24.47 6.135 1 23 i2 = 132446924 3768 8 1.273 26.79 1.446 1.099 −0.190 1.430 1.342 5 54 i3 = 132633722 3776 127 1.466 27.20 1.510 1.423 0.194 1.401 1.380 5 55 i4 = 135934758 3903 97 1.359 19.86 1.402 1.316 −0.107 0.422 0.578 17 46 i5 = 139176548 4000 67 1.433 37.85 1.504 1.361 0.074 0.273 0.143 – –7 i6 = 140416906 4067 12 1.879 58.37 2.088 1.670 0.446 5.260 3.411 2 38 i7 = 140895309 4079 57 1.371 22.75 1.430 1.311 −0.508 4.428 11.36 2 29 i8 = 142501212 4136 123 1.481 31.77 1.530 1.433 0.111 0.538 0.385 4 21

10 i9 = 146659997 4259 84 1.976 188.7 2.118 1.833 0.494 7.686 1.294 2 511 i10 = 149656964 4343 62 0.636 29.40 0.701 0.570 −1.340 9.512 61.05 1 112 i11 = 151782674 4405 16 0.799 69.63 0.997 0.601 0.163 0.902 0.381 2 3813 i12 = 152348882 4421 80 1.999 156.2 2.131 1.866 1.200 20.68 9.219 1 114 i13 = 155059985 4501 48 0.742 31.54 0.819 0.665 −1.257 10.11 50.06 1 1

Table 4CNVs characteristics consistent to Table 3 for the 2nd sample from “159A–vs–159D–cut” (ROMA).

j jth segment Statistics 3-� Jitter 3-�

Initial point, b nj−1 Nj aj �2j

aUBj

aLBj

�j−1 −j−1

+j−1

kLj−1

kRj−1

1 0.008182 0 515 0.891 11.59 0.906 0.877 – – – – –2 i1 = 1361572 515 4 0.698 0.613 0.735 0.661 −0.193 3.222 60.88 2 13 i2 = 13748132 519 620 0.886 10.57 0.898 0.874 0.188 57.51 3.337 1 24 i3 = 29908634 1139 43 1.051 6.9155 1.089 1.013 0.165 2.580 3.944 4 35 i4 = 31197989 1182 378 0.883 10.52 0.899 0.867 −0.168 4.088 2.686 3 46 i5 = 43670258 1560 31 1.054 7.555 1.101 1.007 0.171 2.775 3.865 4 37 i6 = 44503417 1591 315 0.869 11.65 0.887 0.851 −0.185 4.524 2.935 3 48 i7 = 55133400 1906 9 0.974 15.00 1.097 0.852 0.105 0.954 0.740 5 99 i8 = 55421600 1915 68 0.866 9.916 0.902 0.830 −0.108 0.780 1.180 10 4

10 i9 = 57419734 1983 7 1.070 6.376 1.160 0.979 0.204 4.178 6.499 3 211 i10 = 57834480 1990 99 0.874 9.330 0.903 0.845 −0.195 5.992 4.095 2 312 i11 = 60761193 2089 14 1.000 8.449 1.074 0.927 0.126 1.702 1.879 5 413 i12 = 61183941 2103 219 0.877 10.31 0.898 0.856 −0.123 1.800 1.475 4 514 i13 = 68128478 2322 25 1.064 4.704 1.106 1.023 0.187 3.406 7.468 3 215 i14 = 68866007 2347 8 0.808 5.307 0.885 0.731 −0.257 13.99 12.40 1 116 i15 = 69116139 2355 20 1.051 3.205 1.089 1.013 0.243 11.13 18.46 1 117 i16 = 69558968 2375 148 0.879 11.36 0.905 0.852 −0.172 9.256 2.612 2 418 i17 = 73766518 2523 20 1.001 11.58 1.074 0.929 0.123 1.324 1.298 5 519 i18 = 74221771 2543 40 0.819 14.48 0.876 0.762 −0.183 2.878 2.304 3 420 i19 = 75395596 2583 384 0.887 9.414 0.902 0.873 0.069 0.326 0.502 46 321 i20 = 88410178 2967 32 1.137 11.62 1.194 1.080 0.250 6.632 5.374 2 222 i21 = 93333622 2999 436 1.068 6.468 1.080 1.057 −0.069 0.409 0.735 32 323 i22 = 109227575 3435 20 0.929 6.561 0.983 0.874 −0.140 3.016 2.973 3 324 i23 = 109611296 3455 339 1.058 7.336 1.072 1.044 0.129 2.531 2.264 4 425 i24 = 119366969 3794 65 1.244 7.799 1.276 1.211 0.186 4.713 4.433 3 326 i25 = 121496792 3859 95 0.875 10.48 0.907 0.843 −0.369 17.41 12.96 1 127 i26 = 124208998 3954 3 1.358 4.516 1.475 1.242 0.483 22.29 51.74 1 128 i27 = 124255502 3957 138 0.896 13.68 0.926 0.866 −0.462 47.33 15.62 1 129 i28 = 129730088 4095 3 1.293 10.81 1.473 1.113 0.397 11.53 14.60 1 130 i29 = 130540104 4098 774 0.875 10.45 0.886 0.864 −0.418 16.19 16.73 1 131 i30 = 153359817 4872 21 0.786 8.931 0.847 0.724 −0.089 0.764 0.895 8 532 i31 = 153907066 4893 51 0.867 9.052 0.907 0.827 0.081 0.741 0.731 6 733 i32 = 155689115 4944 76 1.256 8.784 1.288 1.223 0.389 16.67 17.18 1 134 i33 = 157810927 5020 92 1.038 4.855 1.060 1.016 −0.218 5.400 9.771 2 235 i34 = 161040443 5112 14 1.276 8.264 1.349 1.203 0.239 11.72 6.888 2 236 i35 = 161358721 5126 651 1.061 6.034 1.070 1.051 −0.216 5.631 7.71273 2 237 i36 = 180598006 5777 63 1.268 11.25 1.308 1.228 0.208 7.147 3.832 2 338 i37 = 182531303 5840 122 1.064 4.356 1.081 1.046 −0.205 3.766 9.625 3 239 i38 = 188033882 5962 7 1.442 26.54 1.626 1.257 0.378 32.81 5.385 1 240 i39 = 188185029 5969 923 1.058 5.672 1.065 1.051 −0.384 5.543 25.94 2 141 i40 = 217472857 6892 22 1.321 31.500 1.435 1.208 0.262 12.21 2.199 2 442 i41 = 218088420 6914 796 1.049 6.9353 1.058 1.040 −0.272 2.357 11.71 4 2

342 J. Munoz-Minjares et al. / Biomedical Signal Processing and Control 13 (2014) 337–344

some

4

ssspl

sgtm

Flia

Fig. 4. Measurements and estimates of the 2nd chromo

.2. Detected data with large segmental errors

Another database processed corresponds to the 2nd chromo-ome in “159A–vs–159D–cut” of ROMA supposedly having 42egments and 41 breakpoints. This case (Fig. 4) demonstrates largeegmental errors when a segment is represented only with a fewrobes. In turn, only several breakpoints are accompanied here with

arge jitter. Some statistics of this chromosome are given in Table 4.Four specific regions of this chromosome associated with large

egmental errors are sketched in Figs. 5 and 6. Even a quick look sug-ests that in almost all of the cases of short chromosomal changeshe segmental errors reach tens of percents. Moreover, some seg-

ents cannot be estimated at all with a reasonable error. Thus,

ig. 5. UB and LB masks for the estimates of the CNVs shown in Fig. 3: (a) genomicocation from 10 Mb to 45 Mb and (b) genomic location from 50 Mb to 80 Mb. Errorsn the estimates of a1, a6, a14, and a16 exceed 30%. Errors in the estimates of a10, a12,15, and a18 reach 40–50%. There is a probability that changes a2 and a8 do not exist.

taken from the archive “159A–vs–159D–cut” of ROMA.

there is a probability that such changes do not exist. In fact, errorsin the estimates of a4, a6, a14, a16, a25, a33, and a37 exceed 30%. Asituation is even worse with a10, a12, a15, a18, a35, and a39, wherethe estimation errors reach (40–50)% – thus such estimates are notvery useful. Furthermore, there are two segments a2 and a8 whichlevels cannot be estimated correctly and the question arises of theirexistence.

Another specific of this chromosome is that a detected partaround the breakpoint i19 does not contain enough information for

experts. As a consequence, neither a19 nor i19 can be estimated witha reasonably small error. A bit better situation is with a31, i30, andi31. But there is not enough accuracy here as well and the predictedchange associated with a31 may not exist.

Fig. 6. UB and LB masks for the estimates of the CNVs shown in Fig. 3: (a) genomiclocation from 119 Mb to 126 Mb and (b) genomic location from 150 Mb to 190 Mb.Errors in the estimates of a25, a33, and a37 exceed 30%. Errors in the estimates of a35

and a39 reach 40–50%.

J. Munoz-Minjares et al. / Biomedical Signal Pro

Fgi

5

5

Fiiamats2Nir

d1UN(sie

5

a

tititinr

5

c

be combined with an efficient detector of the breakpoints. A spe-cial attention should be paid to mathematical justification of theprobability that the detected breakpoint exists in the confidence

ig. 7. Expansions of the UB and LB masks by increasing the probability of existing ofenomic changes around two detected breakpoints. The estimated CNVs are dashedn bold and detected data are dotted.

. Discussion

.1. Segmental errors

Fig. 2 suggests that an average segmental variance is �2 = 53.24.or white segmental noise, it can be reduced by averaging as �2/Nl

n order to reach the standard deviation of 0.243 (we think that its acceptable) over Nl = 900 probes. It turns out however that onlyˆ1 is observed with N1 = 602 probes and all other segments have

uch smaller Nl. Hence, most of segments will suffer of large errorsnd higher probe resolution will be required. It was shown in [4]hat the CNVs may be about 1 kb in size. In order to obtain smallegmental errors, the probes must thus have a resolution at least00 bp as reported in [7] and even higher that is available usingGS technologies. For the detection resolution of 5–10 kb reported

n [6] and 30 kb in [21], an estimator will not be able to provide aealistic CNVs picture, because some short segments will be lost.

The CNVs estimation error can be supposed to be negligible if itoes not exceed a threshold of ±0.01 or 1% of an average change of.0 in log2 Ratio. For the average variance �2 = 53.24 in Fig. 2, theB and LB suggest that the estimate will stay below this threshold ifl = 4791. However, a maximal segmental length in Fig. 4 is Nl = 923

see a40) that corresponds to the threshold of about ±0.023. Otheregments require larger and even much larger thresholds. So, theres no one genomic change in Fig. 2 which can be estimated with 1%rror. In the best case, the estimation error exceeds 2.3%.

.2. Jitter in the breakpoints

It follows from Tables 3 and 4 that a minimal jitter of ±1 points ischieved with > 10 or 10 dB. However, most frequently we meet∼= 1 or 0 dB that causes the jitter error of several points. Some-

imes, occurs to be small only in one of the segments (i39 and i40n Table 4). This does not lead to a substantial difference betweenhe left and right jitters. But, if the SNRs fall essentially below unityn both segments, jitter may reach tens of points and nothing defini-ive can be said about the breakpoint location. This is the case of5 in Table 3 and Fig. 4a. So, an acceptable jitter of ±1 points isot common for the HR-CGH arrays-based probes. Typically, jittereaches here several points.

.3. Basic properties of confidence UB and LB masks

Based on the above-provided analysis, we come up with aonclusion that the confidence UB and LB masks can serve as a

cessing and Control 13 (2014) 337–344 343

powerful tool for medical applications owing to the following basicproperties:

• The true CNVs exist between BUn and BL

n with the required prob-ability P(Nl).

• If either BUn or BL

n covering two or more breakpoints occurs to beuniform, then there is a probability of no changes in this region.

• If both BUn and BL

n covering two or more breakpoints occur to beuniform, then there is a high probability of no changes in thisregion.

We finally learn effect of the confidence interval on the UB andLB masks. To illustrate, we consider a part of detected CNVs (bolddots in Fig. 7) which consists of two breakpoints and is highly con-taminated by noise. We then compute the masks using algorithm(Table 2) for several most common probabilities listed in Table 1.Fig. 7 shows that the 1-sigma sense (68.27%) is closely relatedto probable existence (75%). With such probabilities, jitter in thebreakpoints does not exist and segmental changes are estimatedat an acceptable error level of about 0.1. If to increase the prob-ability and pass through the certain existence and 2-sigma senseto the 3-sigma sense (99.73%), then the UB and LB stretch so thatthe genomic changes cannot be said to exist with acceptable errors.This example neatly demonstrates that a final deduction about theCNVs structure will strongly depend on the confidence interval.Further investigations are thus required in order to specify thisinterval via the probability P which is most acceptable for medicalconclusions.

6. Conclusions

Modern technologies such as NGS allow detecting genomechanges with probe resolution of about 0.8–6 kb in white Gaussiannoise having segmental SNRs around unity. Under such conditions,estimates of segmental changes and breakpoint locations are oftenaccompanied with large errors, especially if genome changes haveabout 1 kb in length. In view of large noise, no one estimator evenideal is able to provide jitter-free detection of the breakpoint loca-tions and error-free estimation of segmental changes.

The confidence UB and LB masks proposed in this paper out-line a region of the endpoints within which the CNVs estimatesexist with a given probability. The masks can thus serve as anauxiliary tool for medical experts to make decisions about CNVsstructures. The algorithm designed is able to produce the masksfor any confidence interval. Testing some HR-CGH array data inthe 3-sigma sense (confidence interval of 99.73%) has revealederrors of (30 . . . 50)% and larger in many segments. Jitter was alsoshown to be large in some breakpoints. Moreover, it was indicatedthat there is a probability that some changes predicted by ROMAmay not exist. By reducing the confidential interval, the bound-aries inherently squeeze around the predicted changes. However,the masks lose their usefulness when the confidential intervalreaches 50%.

Based upon this investigation, we end up with a conclusionthat further works should be focused on optimizing the confidenceinterval in order to meet medical needs. Also, the algorithm should

masks. Finally, a more accurate approximation of the jitter distri-bution needs to be found as long as the discrete skew Laplace lawdoes not fit the process well with small SNRs. We work on it nowand expect presenting some results in near future.

3 nal Pro

R

[

[

[

[

[

[

[

[

[

[

[

[

[

[

44 J. Munoz-Minjares et al. / Biomedical Sig

eferences

[1] A.F. Cockburn, M.J. Newkirk, R.A. Firtel, Organization of the RNA genes of dic-tyostelium: mapping of the nontrascribed spacer regions, Cell 9 (4) (1976)605–613, Part 1.

[2] F.S. Collins, E.S. Lander, J. Rogers, R.H. Waterson, Finishing the euchromaticsequence of the human genome, Nature 431 (7011) (2004) 931–945.

[3] R. Redon, S. Ishikawa, K.R. Fitch, L. Feuk, G.H. Perry, T.D. Andrews, H. Fiegler,M.H. Shapero, A.R. Carson, W. Chen, E.K. Cho, S. Dallaire, J.L. Freeman, J.R. Gon-zalez, M. Gratacos, J. Huang, D. Kalaitzopoulos, D. Komura, J.R. MacDonald,C.R. Marshall, R. Mei, L. Montgomery, K. Nishimura, K. Okamura, F. Shen, M.J.Somerville, J. Tchinda, A. Valsesia1, C. Woodwark, F. Yang, J. Zhang, T. Zerjal, J.Zhang, L. Armengol, D.F. Conrad, X. Estivill, C. Tyler-Smith, N.P. Carter, H. Abu-ratani, C. Lee, K.W. Jones, S.W. Scherer, M.E. Hurles, Global variation in copynumber in the human genome, Nature 444 (2006) 444–454.

[4] P. Stankiewicz, J.R. Lupski, Structural variation in the human genome and itsrole in disease, Ann. Rev. Med. 61 (2010) 437–455.

[5] S. Yoon, Z. Xuan, V. Makarov, K. Ye, J. Sebat, Sensitive and accurate detectionof copy number variants using real depth of coverage, Genome Res. 19 (2009)1586–1592.

[6] H. Ren, W. Francis, A. Boys, A.C. Chueh, N. Wong, P. La, L.H. Wong, J. Ryan,H.R. Slater, K.H.A. Choo, BAC-based PCR fragment microarray: high-resolutiondetection of chromosomal deletion and duplication breakpoints, Hum. Mutat.25 (5) (2005) 476–482.

[7] A.E. Urban, J.O. Korbel, R. Selzer, T. Richmond, A. Hacker, G.V. Popescu, J.F.Cubells, R. Green, B.S. Emanuel, M.B. Gerstein, S.M. Weissman, M. Snyder, High-resolution mapping of DNA copy alterations in human chromosome 22 usinghigh-density tiling oligonucleotide arrays, Proc. Natl. Acad. Sci. (PNAS) 103 (12)(2006) 4534–4539.

[8] J.O. Korbel, A.E. Urban, J.P. Affourtit, B. Godwin, F. Grubert, J.F. Simons, P.M. Kim,D. Palejev, N.J. Carriero, L. Du, B.E. Taillon, Z. Chen, A. Tanzer, A.C.E. Saunders,J. Chi, F. Yang, N.P. Carter, M.E. Hurles, S.M. Weissman, T.T. Harkins, M.B. Ger-stein, M. Egholm, M. Snyder, Paired-end mapping reveals extensive structuralvariation in the human genome, Science 318 (5849) (2007) 420–426.

[9] C. Xie, M.T. Tammi, CNV-seq, a new method to detect copy number variantionusing high-throughput sequencing, BMC Bioinform. 10 (80) (2009) 1–9.

10] S. Ivakhno, T. Royce, A.J. Cox, D.J. Evers, R.K. Cheetham, S. Tavaré, CNAseg– a novel framework for identification of copy number changes in cancer

from second-generation sequencing data, Bioinformatics 26 (24) (2010) 3051–3058.

11] V. Boeva V., A. Zinovyev, K. Bleakley, J.P. Vert, I. Janoueix-Lerosey, O. Delattre,E. Barillot, Control-free calling of copy number alterations in deep-sequencingdata using GC-content normalization, Bioinformatics 27 (2) (2011) 268–269.

[

[

cessing and Control 13 (2014) 337–344

12] A. Gusnanto, H.M. Wood, Y. Pawitan, P. Rabbitts, S. Berri, Correcting for cancergenome size and tumour cell content enables better estimation of copy num-ber alterations from next-generation sequence data, Bioinformatics 28 (2012)40–47.

13] J. Duan, J.G. Zhang, H.W. Deng, Y.P. Wang, Comparative studies of copy num-ber variation detection methods for next-generation sequencing technologies,PLoS ONE 8 (3) (2013) 1–12, e59128.

14] Y.H. Yang, S. Dudoit, P. Luu, D.M. Lin, V.P. Peng, J. Ngai, T.P. Speed, Normalizationfor cDNA microarray data: a robust composite method addressing single andmultiple slide systematic variation, Nucleic Acids Res. 30 (4) (2002) 1–10.

15] C. Zong, S. Lu, A.R. Chapman, X.S. Xie, Genome-wide detection of single-nucleotide and copy-number variations of a single human cell, Science 338(2012) 1622–1626.

16] J. Munoz-Minjares, J. Cabal-Aragón, Y.S. Shmaliy, Effect of noise on estimates ofstepwise changes in genome DNA chromosomal systems, WSEAS Trans. Biol.Biomed. 11 (2014) 52–61.

17] W.R. Lai, M.D. Johnson, R. Kucherlapati, P.J. Park, Comparative analysis ofalgorithms for identifying amplifications and deletions in array CGH data,Bioinformatics 21 (19) (2005) 3763–3770.

18] J. Munoz-Minjares, Y.S. Shmaliy, J. Cabal-Aragón, Confidence limits for genomeDNA copy number variations in HR-CGH array measurements, Biomed. SignalProcess. Control 10 (2014) 166–173.

19] F. Picard, S. Robin, M. Lavielle, C. Vaisse, J.J. Daudin, A statistical approach forarray CGH data analysis, BMC Bioinform. 6 (27) (2005) 1–14.

20] J. Munoz-Minjares, J. Cabal-Aragon, Y.S. Shmaliy, Jitter probability in the break-points of discrete sparse piecewise-constant signals, in: Proc. 21st EuropeanSignal Process. Conf. (EUSIPCO-2013), 2013, pp. 1–5.

21] R. Lucito, J. Healy, J. Alexander, A. Reiner, D. Esposito, M. Chi, L. Rodgers, A.Brady, J. Sebat, J. Troge, J.A. West, S. Rostan, K.C.Q. Nguyen, S. Powers, K.Q. Ye, A.Olshen, E. Venkatraman, L. Norton, M. Wigler, Representational oligonucleotidemicroarray analysis: a high-resolution method to detect genome copy numbervariation, Genome Res. 10 (2003) 2291–2305.

22] J. Munoz-Minjares, J. Cabal-Aragon, Y.S. Shmaliy, Probabilistic bounds for esti-mates of genome DNA copy number variations using HR-CGH microarrays, in:Proc. 21st European Signal Process. Conf. (EUSIPCO-2013), 2013, pp. 1–5.

23] J. Wu, K.R. Grzeda, C. Stewart, F. Grubert, A.E. Urban, M.P. Snyder, G.T. Marth,Copy number variation detection from 1000 genomes project exon capturesequencing data, BMC Bioinform. 13 (305) (2012) 1–19.

24] J.H. Chu, A. Rogers, I. Ionita-Laza, K. Darvishi, R.E. Mills, C. Lee, B.A. Raby, Copynumber variation genotyping using family information, BMC Bioinform. 14(157) (2013) 1–11.

25] T.J. Kozubowski, S. Inusah, A skew Laplace distribution on integers, Ann. Inst.Stat. Math. 58 (2006) 555–571.