White Paper - Qiagen · White paper: Copy number variant detection Section3.2). Every target is then characterized by the average log-coverage in the case sample and in the baseline

Sample to Insight

White Paper

Copy number variant detection

August 19, 2015

CLC bio, a QIAGEN CompanySilkeborgvej 2 Prismet 8000 Aarhus C DenmarkTelephone: +45 70 22 32 44 Fax: +45 86 20 12 22www.clcbio.com [email protected]

White paper: Copy number variant detection

Abstract

In this white paper, we present the Copy Number Variant (CNV) detection tool, available inthe Biomedical Genomics Workbench from version 2.1 and above. We outline the algorithmimplemented in the tool, and present accuracy benchmarks comparing the tool to state-of-the-art methods. Our results show that the CNV detection tool is capable of highly accurateidentification of copy number variations in a broad range of next-generation sequencing data,achieving a gene-level detection sensitivity of 100% in nearly all our benchmarks. An addedadvantage of the CNV detection tool is that it is an in-built functionality of the BiomedicalGenomics Workbench, so the results can be visualized and interpreted together with theread mappings and the results of other tools. We conclude that the CNV detection tool isa versatile and accurate method for the identification of copy number variants in NGS data,and it has the potential to provide valuable insights in basic medical research, as well astranslational clinical settings.

Contents

1 Introduction 4

1.1 Aims for the CNV detection tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Requirements for the CNV detection tool . . . . . . . . . . . . . . . . . . . . . . 4

2 A brief overview of the statistical model 4

3 Detailed Methods 6

3.1 Computing base-level coverages and generation of a robust baseline . . . . . . . 6

3.2 Normalization of coverages and chromosome-level analysis . . . . . . . . . . . . . 6

3.3 Log-ratios and their adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.4 Chromosome segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.5 Estimation of σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.6 Statistical testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Accuracy benchmarks 9

4.1 Data and parameters used in benchmarking . . . . . . . . . . . . . . . . . . . . . 9

4.2 Benchmarking approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3.1 Benchmark 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3.2 Benchmark 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3.3 Benchmark 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Conclusions 14

P. 2


6 Acknowledgements 14

A Detailed sample-level results 16

A.1 Benchmark 1: Deep sequencing of inherited disease samples after PCR-basedtarget-enrichment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.2 Benchmark 2: Whole-exome sequencing of melanoma. . . . . . . . . . . . . . . . 18

A.3 Benchmark 3: Clinical cancer genomic profiling. . . . . . . . . . . . . . . . . . . . 19

P. 3


1 IntroductionCopy number variations (CNVs) are a commonclass of structural alterations in the genome,where sections of the genome are deleted orduplicated compared to a reference set. CNVscan range from about 50 bases to severalmegabases in size, and CNVs greater than1 kilobase account for most bases that varyamong human genomes [1]. Some of this vari-ation is benign, but CNVs have also been impli-cated in many diseases, including cancer, inher-ited disorders, cardiovascular disease, autismand schizophrenia [2--4].

Traditionally, CNVs have been identified usingarray technologies, such as comparative ge-nomic hybridization (CGH) [5]. With the rapidlyincreasing adoption of next-generation sequenc-ing (NGS) technologies for the detection of ge-nomic variants, there is an emerging need toidentify CNVs from NGS-based data. Severalapproaches have focused on whole-genome se-quencing (WGS) data (see [6] for a review),but targeted resequencing (TR) remains themost cost-effective strategy to identify disease-causing variants.

The identification of CNVs from TR data carriesa unique set of challenges. The targeted regionsare small, typically only 100-300 bp. The ge-nomic coverage is sparse and non-contiguous,making it highly unlikely that the breakpointsare covered by reads. Approaches developedfor WGS data are therefore unsuitable for de-tecting CNVs from TR data, and utilizing com-parative depth-of-coverage remains the mostwidely used strategy in currently existing CNVdetection tools.

Depth-of-coverage methods are based on count-ing the number of reads that cover each tar-geted region. Due to the variability of targetcoverages, affected by issues such as GC con-tent bias or the mappability of reads, moststate-of-the-art methods require a set of con-trol samples. After normalization to correct forvarying library sizes, statistical inferences aremade on the ratio of the coverages in the casesample and the control samples.

It is difficult to evaluate the performance of CNVtools, particularly because a true "gold stan-dard" reference list does not exist for CNVs [8].The high number of predicted CNVs and thehigh rate of false positive calls by state-of-the-art methods also makes the biological experi-mental validation of CNVs unrealistic [7]. In thisbenchmarking effort, we used a range of biologi-cal datasets with known copy number variationsto evaluate the sensitivity and specificity of theCNV detection tool in the Biomedical GenomicsWorkbench.

1.1 Aims for the CNV detection tool

The purpose of the CNV detection tool in theBiomedical Genomics Workbench is to identifycopy number variants in the following types oftargeted resequencing data:

1. Whole-exome sequencing2. NGS data from hybridization-capture and

amplicon-based gene panels

1.2 Requirements for the CNV detectiontool

The CNV detection tool has the following re-quirements:

1. Mapped NGS reads both for the casesample and one or more control samples

2. The list of non-overlapping targeted re-gions

The statistical models are most accurate whena large number of targets are available.

2 A brief overview of the statisti-cal model

The CNV detection tool is based on an analysisof the coverage depth in the case sample incomparison to a baseline generated from thecontrol samples. The generation of the base-line is described in Section 3.1. The base-levelcoverages of the case and baseline are first nor-malized to account for varying library sizes (see

P. 4


Section 3.2). Every target is then characterizedby the average log-coverage in the case sampleand in the baseline. The log-ratio of these cov-erages is the non-adjusted target-level log-ratio,which is subsequently corrected for coveragebias (see Section 3.3). The resulting quantityis the adjusted target-level log-ratio, which wesimply refer to as the log-ratio, Xi, for target i.

The log-ratios vary from target to target, causedboth by statistical noise and variations due to"true" CNVs. In a simple approach, we canmodel the log-ratio for target i with a normaldistribution:

Xi ∼ N(0, σ2c ) (1)

where the variance, σ2c (a measure for the sta-tistical noise) is a function of coverage c. Toestimate σc, a ‘binning’ approach can be used,as in the CONTRA algorithm [9]. Briefly, tar-gets are binned by their log-coverage, and thestandard deviation is calculated for each bin.A curve is fitted, which is used to produce avalue for the expected variation, σ̂c, given anylog-coverage value (more details are providedin Section 3.5.) A p-value is then computedfor each target using σ̂c for the target’s log-coverage. CNVs are identified as the targetswhose p-values indicate that they are statisticaloutliers compared to the rest of the dataset.

The above procedure works well if many re-gions are targeted, and only few of them areexpected to be affected by CNVs (as in the caseof many inherited diseases, for example). But ifmany targets are affected by CNVs (such as inFigure 1(a) and (b)), then true copy number vari-ations will be incorrectly attributed to statisticalnoise. This leads to a decreased sensitivity tolarge CNVs, as has also been noted by Tan etal. for the CONTRA algorithm [7]. It is there-fore necessary to extend the model to accountfor potential large-scale changes in log-ratioscaused by changes in copy number. In theCNV detection tool, we do this by allowing for anon-zero mean in constant copy-number regionsduring the estimation of the statistical noise.Combining Equation 1 with the sparse normalmean model described in [10], we model thelog-ratios Xi with a normal distribution, where

the mean µR only depends on the copy numberof the region, and the variance σ2c is a functionof target coverage c:

Xi ∼ N(µR, σ2c ) (2)

Here, R is a constant copy-number region withi ∈ R. The main goal of the algorithm in theCNV detection tool is to estimate µR and σcaccurately, and thus separate the large-scalecopy-number changes from the smaller-scalestatistical noise in each copy-number-constantregion.

The estimation of these parameters and thestatistical testing using the combined model iscarried out in a stepwise fashion.

1. We segment each chromosome into con-stant copy-number regions, using theScreening and Ranking Algorithm (SaRA)described in [10]. Here, we use a con-stant estimator σ̂ for the intra-region vari-ability, as we assume that the intra-regionvariability is much smaller than the inter-region variability. Thus, the coverage-dependence of the intra-region varianceis ignored for the purposes of this seg-mentation step only (Section 3.4).

2. For each region, we compute a median-based robust estimate µ̂R for µR. Sub-tracting this estimate from each Xi, weexpect Xi − µ̂R ∼ N(0, σ2c ), where i ∈ R.This is identical in form to the simplemodel described in Equation 1, and thebinning approach of CONTRA can there-fore be used to compute a copy number-adjusted coverage-dependent model forσc (Section 3.5).

3. For each target, we obtain (one-sided)p-values for amplification and deletionagainst a null hypothesis of no change,using the copy-number adjusted estima-tor σ̂c. Finally, we use Fisher’s methodfor combining the target-level p-values toobtain p-values for each region, against ahypothesis of no change (Section 3.6).

P. 5


Figure 1: The adjusted log2-ratios in two different types of samples. The black dots represent the ad-justed log-ratios calculated at each targeted region. The blue and red plateaus are the regions producedby the CNV detection tool after segmentation. The red regions mark the targets that were predicted tobe part of a CNV at a significance level of 0.05 and fold-change cutoff of 1.4. The blue regions were notcalled by the CNV detection tool. The datasets are described in more detail in Section 4.1. (a) Sam-ple M62-4 from Benchmark 1, inherited large-scale genomic alterations. A large part of chromosome8 (chr8:107120037-119294603) is affected by a heterozygous deletion. The sample (NA09888, CorrielInstitute) was probed with QIAGEN amplicon-based CNV panel CNA902Y. (b) Sample from Benchmark3, metastatic cancer. The top and bottom plots are identical, except the scaling on the y-axis. The sampleis affected by a large number of CNVs. The sequencing data are accessible under accession numberSRR948983, originally published in [13]. The target coordinates were obtained from the authors of [13](personal communication).

3 Detailed MethodsIn this section, we describe the steps of the al-gorithm implemented in the CNV detection toolin more details.

3.1 Computing base-level coverages andgeneration of a robust baseline

For each case and control sample, base-levelcoverages are computed by counting the num-ber of reads aligned to every targeted base.Then, a robust baseline for base-level cover-ages is generated from all control samples by

computing the trimmed mean of the coveragesat each base in the control samples, as de-scribed in [9]. In this baseline, each targetednucleotide is associated with a coverage value,which quantifies the overall depth-of-coverageof the nucleotide in the control samples.

3.2 Normalization of coverages andchromosome-level analysis

Both the library size of the case sample, Lcase,and library size of the baseline, Lbaseline, aredefined as the sum of the coverages at all po-sitions. Clearly, the library sizes depend on

P. 6


the sequencing depths of the different sam-ples, and normalization must be done beforecomputing the target-level log-ratios. A simpleapproach is to scale both the case and thebaseline library sizes to their geometric mean,as is done in [9]. Thus, a single scaling value,s, can be used to normalize every raw coveragein the case-baseline pair:

s =

√LcaseLbaseline

Lcase(3)

The nucleotide-level normalized coverage d ateach targeted position in the case and baselineis computed from the un-normalized coveragec:

dcase = ccase · s+ δ (4)

dbaseline =cbaseline

s+ δ (5)

δ is a small offset (equal to 0.5 in the CNVdetection tool) used to prevent zero coveragesin either the case or the baseline.

However, when large parts of the genome areaffected by CNVs, as is the case in Figure 1(a)and (b), then the library size of the case sam-ple will be significantly affected by these CNVs,and the above approach will incorrectly attributethis effect to a difference in sequencing depth.Therefore, in the CNV detection tool, we firstdetect chromosomal coverage outliers (chromo-somes possibly containing large-scale CNVs),and compute Lcase and Lbaseline only using thechromosomes that are found to have "typical"coverages.

We detect chromosomal coverage outliers us-ing linear regression analysis, in the followingsteps.

1. A linear model is used to model the chro-mosomal coverages:

Ccase = mCbaseline + ε (6)

where:

• Ccase is a random variable represent-ing the total coverage at targeted po-sitions on a chromosome in the casesample

• Cbaseline is a random variable rep-resenting the total coverage at tar-geted positions on the same chro-mosome in the baseline• m is a constant multiplier, related

to the different sequencing depth inthe case sample compared to thebaseline• ε is the error, assumed to be Gaus-

sian

2. The parameter m is estimated from thedata by linear regression analysis.

3. For each chromosome k, the studentizedresidual sk is computed using the ob-served chromosomal coverages cbaseline,k

and ccase,k:

sk =ek√

MSE · (1− hk)(7)

where ek = mcbaseline,k − ccase,k is the ob-served residual, hk is the leverage (equalto the kth diagonal entry in the hat ma-trix), and the MSE is the mean squarederror.

4. The studentized residuals sk are T-distributed with 1 degree of freedom. Weuse this distribution to identify outliers at5% significance. "Normal" chromosomesare defined as the non-significant chro-mosomes under this test. The remainingchromosomes are outliers.

In the following sections, we use ‘‘coverage’’ torefer to normalized coverages computed usingEquations 4 and 5, but only using the chromo-somes that are not detected to be outliers.

3.3 Log-ratios and their adjustment

In the CNV detection tool, we define the base-line log-coverage of a target i with length Nas:

φi = log21

N

∑p

dbaseline,p (8)

and the non-adjusted log-ratio for the same tar-get i as:

ri = log2

1N

∑p dcase,p

1N

∑p dbaseline,p

(9)

P. 7


where the sums are over all positions p in thetarget.

Li et al. [9] observed a linear variation of log-ratios with log-coverage, depending on the dif-ferences in library sizes between case and con-trols. They also proposed a linear correctionmodel, which has been implemented in a si-milar way in the CNV detection tool. In theCNV detection tool, a straight line is fitted be-tween the non-adjusted log-ratios and baselinelog-coverages, and this fitted line is subtractedfrom each non-adjusted log-ratio to produce theadjusted log-ratio xi.

xi = ri − a · φi − b (10)

where a and b are the parameters of the fittedline. In the following sections, by "log-ratios"we refer to the adjusted log-ratios.

3.4 Chromosome segmentation

The chromosomes are segmented using themulti-bandwidth SaRA algorithm [10]. A band-width is an integer value that corresponds toa window in which the algorithm searches forbreakpoints. In the CNV detection tool, thebandwidths are determined by the "graininglevel" parameter, and do not depend on thechromosome lengths. Here we give a briefdescription of the SaRA approach.

1. Taking one bandwidth value h, the diag-nostic function is computed for every tar-get i ∈ [1, n] on a chromosome with lengthn. The diagnostic function is the differ-ence of the averages of the adjusted log-ratios near the target i, within a windowof size h:

D(i, h) =1

h

n∑m=1

wm(i)xm (11)

where wm(i) = sgn(m+ 12 − i) for i− h <

m ≤ i+ h and wm(i) = 0 otherwise.2. The set of local maximizers corresponding

to bandwidth h is identified. The target iis an h-local maximizer if

∀i′ ∈]i−h, i+h] : D(i, h) ≥ D(i′, h) (12)

3. Steps 1 and 2 are repeated with the otherbandwidth values. A common set of localmaximizers is computed as the union ofthe sets obtained for the different band-widths. This set contains the potentialbreakpoints to define the segments.

4. Local maximizers are removed from thisset one-by-one, until the Bayesian Infor-mation Criterion (BIC) for the remaininglocal maximizers is minimized. The BICfor a set J̃ of breakpoints is:

BIC(J̃) =n

2· log(σ2) + J̃ log(n) (13)

where n is the total number of targets onthe chromosome, and σ is the weightedaverage of the variance within all regionsproduced by the breakpoints. The valueof the BIC will, in general, decrease ifa breakpoint is removed. In each roundof the optimization, we remove the localmaximiser whose removal leads to theleast decrease in variance. We continueremoving local maximizers until the BICno longer decreases.

5. Once the final set of local maximizers isobtained, the targets in between the localmaximizers are joined into regions.

3.5 Estimation of σ

Once the constant copy-number regions areidentified, we compute the median log-ratio ineach region R and use it as an estimate µ̂R forµR. The median is used instead of the average,because it is a more robust estimator of themean in the cases where some observationsare incorrect. This can for example happen inamplicon-based data, where a single SNP in aprimer region can reduce the coverage of theentire amplicon to zero, despite no true differ-ences in copy number.

Subtracting this estimate from each log-ratioXi inside the region, we define Yi = Xi − µ̂R,which we expect is normally distributed: Yi ∼N(0, σ2c ), where i ∈ R.

The remaining task is to estimate σc as afunction of coverage. Similarly to CONTRA [9],

P. 8


we classify the observations into approximatelyequal-sized bins on the basis of their log-coverage in the baseline. In the CNV detec-tion tool, the number of bins is approximately10, but the precise number of bins is deter-mined dynamically on the basis of the numberof targets in the data. For each bin, we com-pute the median log-coverage, as well as thestandard deviation of the adjusted log-ratiosof the targets in the bin. We estimate thestandard deviation robustly, based on the inter-quartile range. The result of the binning stepis that we have a set of log-coverage-log-ratiovalue pairs. An exponentially decreasing func-tion f(dbaseline) = α · exp(−β · dbaseline), β > 0,is fitted to these points, to produce a conti-nuous model for the expected variation in thelog-ratios for any given log-coverage value (seeillustration in Figure 2).

3.6 Statistical testing

The null hypothesis under the combined modelfor each target is

H0 : Xi ∼ N(µ, σ2) (14)

where: µ = 0, σ = f(dbaseline,i)

Thus, we compute one-sided p-values to testthe alternative hypotheses H1 : µ > 0 (amplifi-cation case) and H1 : µ < 0 (deletion case).

Finally, to test the statistical significance ofeach region, we combine these target-level p-values using Fisher’s method:

χ22k ∼ −2

k∑i=1

ln(pi) (15)

(16)

where pi is the p-value for the ith hypothesistest. If k tests are combined, the test statis-tic χ2k has a chi-squared distribution with 2kdegrees of freedom.

4 Accuracy benchmarks

4.1 Data and parameters used in bench-marking

To assess the accuracy of the CNV detectiontool, we used three markedly different datasets,as detailed below. The datasets were chosen tocover both inherited diseases and cancer, one-copy as well as multiple-copy number variations,a broad size range for genomic alterations, vary-ing sample purities (from 20% to 100%), differ-ent target enrichment technologies (PCR-basedas well as whole exome sequencing), and awide range of sequencing depths (from 30x upto 1500x). All three benchmark datasets havealso been published previously, enabling the di-rect comparison of our method with alternativemethods for CNV predictions.

1. Benchmark 1: Multiplex-PCR enrichmentand deep sequencing of inherited chro-mosomal disease. This PCR-enrichedcustom QIAGEN GeneRead gene paneldataset was originally described in [11],where it was used to benchmark thequandico tool for copy number analysis.The sequencing reads were mapped tothe hg19 assembly and used directly inthe CNV detection tool, without any addi-tional steps. As in [11], the samples weregrouped into three sets: (a) M62 "highcoverage dataset", with a read depth ofapproximately 1500x, (b) M63 "medium-coverage" dataset, with a read depth ofapproximately 650x, and (c) M117 "val-idation" dataset, with a read depth ofapproximately 1000x. In our case, thealgorithm did not require a training step,so all three datasets were used for vali-dation. The M62 and M63 datasets weregenerated using the same custom-madegene panel (CNA902Y), and M117 wasgenerated using a second custom panel(NGHS-991Y). The following samples wereused as controls for the M62 and M63datasets: NA12878 and NA19219. Thefollowing samples were used as controlsfor the M117 dataset: NA12878 and

P. 9


Figure 2: Illustration of the binning procedure to model variation as a function of coverage. The figureswere produced in the Biomedical Genomics Workbench using the CNV detection tool. In the top figure,the adjusted log-ratio of each target is plotted against its log-coverage in the baseline. The log-coveragesare centered around 0 for all coverages, with the greatest variation at the lowest coverages. In the bottomfigure, each blue dot is derived from a "bin" containing a set of targets. For each bin, the mean of thelog-ratios is plotted against the mean of the log-coverages in the baseline. A curve is then fitted (redcrosses), to give a continuous model for the expected variation in log-ratios for any given log-coverage.

NA19240. As only female controls wereavailable, targets on chromosomes X andY were ignored in both the prediction andin the benchmarking counts. The CNVdetection tool was run with default param-eters, except the "Low coverage cutoff"parameter, which was set to 150 for theM62 dataset, 65 for the M63 dataset,and 100 for the M117 dataset, corre-sponding to approximately a tenth of theaverage read depth in each case.

2. Benchmark 2: Whole-exome sequencing

of melanoma. This exome-sequencingdataset was originally described in [12],where it was used to benchmark theExcavator tool for copy number analy-sis. In that study, the samples werealso independently profiled for CNVs onthe Affymetrix 250K SNP Array platform.The sequencing reads were downloadedfrom the Sequence Read Archive underaccession ERP001844, mapped to thehg19 assembly, and used directly in theCNV detection tool, without any additional

P. 10


steps. The following samples were usedas the set of controls for all predictions onthis dataset: ERR174237, ERR174238,ERR174239, ERR174240, ERR174241,and ERR174242. As the target file, weused the Agilent SureSelect S0274956definitions. The CNV detection tool wasrun with default parameters, except the"Low coverage cutoff" parameter, whichwas set to 10 due to the low coverage(approximately 30x) of the samples, andthe "Minimum fold-change cutoff" param-eter, where runs were made both with avalue of 1.4 and 1.2.

3. Benchmark 3: Clinical cancer genomicprofiling. This dataset was originallydescribed in [13], where it was usedto benchmark a proprietary pipeline forcopy number detection. The sequen-cing reads were downloaded from theSequence Read Archive under accessionSRP028580. The target file was ob-tained from the authors of [13] (personalcommunication). The sequencing readswere mapped to the hg19 assembly andused directly in the CNV detection tool,without any additional steps. The fol-lowing samples were used as the setof controls for all predictions on thisdataset: SRR948995, SRR948996 andSRR948997. As only female controlswere available, targets on chromosomesX and Y were ignored in both the predic-tion and in the benchmarking counts. TheCNV detection toolwas run with default pa-rameters, except the "Low coverage cut-off" parameter, which was set to 150 forall the samples, and the "Graining level"parameter, which was set to "Fine" (asCNVs spanning only a few targets wereexpected).

4.2 Benchmarking approach

We used the CNV detection tool to produce a"region-level" CNV track for each sample. Thistrack contains the CNV regions that passed thep-value and fold-change cutoff filtering criteria.

To evaluate the accuracy of each "region-level"CNV prediction result on the target-level, weused the original target track that was usedto produce the prediction, and classified eachtarget into one of the following categories: truepositive (TP), true negative (TN), false positive(FP) and false negative (FN). A TP target wasdefined as one that fulfilled all of the followingcriteria:

• The target overlaps with an expected CNVregion.• The target overlaps with a CNV region pro-

duced by the CNV detection tool.• The direction of change (gain or loss) pre-

dicted by the CNV detection tool is thesame as the expected direction of change.The magnitude of the predicted changewas not considered in these benchmarks.

A FP target overlaps with a predicted CNV re-gion, but not with an expected CNV region. A FNtarget overlaps with an expected CNV region,but not with a predicted CNV region. Lastly, aTN target is not included either in the expectedor in the predicted CNV regions.

We used the following measures to evaluatethe accuracy of our predictions:

• Sensitivity: Sensitivity is defined as theTPR (true positive rate), i.e. the frac-tion of CNVs that were correctly called:TPR = TP/(TP + FN)• Specificity: Specificity is defined as the

TNR (true negative rate), i.e. the fractionof non-CNVs that were correctly not called:TNR = TN/(TN + FP)• NPV: The NPV (negative predictive value)

is the fraction of the predicted negative(non-CNV) calls that were correct, andis thus another measure of sensitivity:NPV = TN/(TN + FN)• PPV: The PPV (positive predictive value)

is the fraction of the predicted posi-tive (CNV) calls that were correct, andis thus another measure of specificity:PPV = TP/(TP + FP)

P. 11


To evaluate the gene-level accuracy, similarclassifications were used, but the counting wasdone on the basis of a gene track, where thetarget regions aimed at the same gene werejoined. This reflects the currently most com-mon use case, where the gene-level predictionsare more relevant, and the exact breakpointsare of secondary importance. Furthermore, italso eliminates possible artifacts in the resultsresulting from "edge effects" due to uncertainCNV breakpoints. To date, no absolute gold-standard dataset exists for benchmarking CNVtools, and there is usually some uncertaintyassociated with the precise CNV breakpoint lo-cations even in the reference datasets. Wenote also that the CNV detection tool was notdesigned to pinpoint breakpoints inside targetregions.

All benchmarks were automated. Target-levelcomparisons were done in Benchmarks 1 and2, and gene-level comparisons were done inBenchmarks 1 and 3. The detailed sample-levelresults are provided in Appendix A.

4.3 Results and Discussion

4.3.1 Benchmark 1

The samples in Benchmark 1 were enrichedusing a multiplex PCR strategy, where se-quence variations can lead to different PCRenrichment efficiencies, and variants (particu-larly short insertions or deletions) may lead to areduced sequencing depth without the presenceof CNVs [11]. Our results can be found in Sec-tion A.1. On the gene-level, again we observeda sensitivity of 100% (all affected genes weredetected), and a specificity of 99%, when mea-sured using the true negative rate. The target-level sensitivity was also high at 95%, with aspecificity of 100% (rounded to the nearest per-cent). We conclude, therefore, that accuracyof the CNV detection tool was very high on thisdataset, with both sensitivities and specificitiesclose to 100%.

Both the gene-level and the target-level PPVswere high in this benchmark (86% and 96%,respectively), as the segmentation was carried

out on a large scale, due to the "Graining level"parameter being set to "Coarse". The PPV canbe increased significantly by filtering the result-ing CNV calls by the number of targets includedin the CNV region.

Interestingly, the FN and FP calls clustered in afew samples. On closer inspection (data avail-able on request), we found that several of thesame samples were also affected by FN and FPcalls in the independent method, quandico [11],which was benchmarked using the same data.This indicates the presence of experimental er-rors, or that the reference annotations may havebeen slightly inaccurate for these regions.

Neither the sensitivity of the predictions northe specificity was correlated with sequencingdepth in these benchmarks. This is expected,because all samples were deeply sequenced,and the most important errors affecting accu-racy were most likely to be systematic (ratherthan random) errors, which cannot be reducedby increasing sequencing depth.

4.3.2 Benchmark 2

Benchmark 2 was a whole-exome sequencingdataset, targeting a very large number of re-gions (over 170,000 targets) at a low depth ofcoverage (approximately 30x). Furthermore, thesamples were derived from melanoma cell linesaffected by large-scale genomic alterations.Due to the whole-exome nature of this dataset,we did not carry out gene-level benchmarks onthis data. Our results can be found in SectionA.2.

Even though this dataset has been previouslyused in benchmarking the Excavator tool [12],no known "true positive" list of CNVs exists forit. In [12], genomic SNP array profiling wasperformed on the same samples, producing anindependent list of CNVs based on non-NGStechnology. In the same work, it was foundthat the NGS-based analysis was significantlymore sensitive than the SNP-array technology,particularly for the detection of small CNVs. Ourresults for this dataset generally support thesame conclusion.

P. 12


We compared our prediction results both withthe predictions of Excavator, and the SNP arrayprofiling results reported in [12]. As Excavatorreports only 5 copy number states, whereasour tool reports a continuum of fold-changescompared to the normal, we ran the CNV de-tection tool both with two fold-change cutoffvalues: 1.4, to increase specificity, and 1.2,to increase sensitivity. As expected, we foundthat there was a very good correlation betweenthe predictions of our tool and Excavator. Ourtool called 95% of the targets that were calledby Excavator at a fold-change cutoff of 1.4, andthis increased to 99% when the fold-changecutoff was reduced to 1.2. Furthermore, 98%of the targets that were called using the SNParray technology were also called by the CNVdetection tool, confirming the high sensitivitieswe have observed with the other benchmarkingdatasets.

The specificity of the calls is more difficult toevaluate, particularly because it is not prac-tically feasible to verify the many calls madeby either Excavator or the CNV detection tool.However, we found a great degree of overlapbetween the predictions of the two tools: withthe fold-change cutoff set to 1.4, 92% of thetargets called by our tool were also called byExcavator, and 98% of the targets not calledby our tool were not called by Excavator, ei-ther. When the fold-change cutoff was reducedto 1.2, the specificities compared to the Ex-cavator calls reduced (PPV of 47%). This wasexpected, because Excavator only reports het-erozygous (one-copy) or homozygous (multiple-copy) changes, corresponding to a minimumfold-change of 1.5 in the case of amplifications,and it is thus not sensitive to fold-changes ofsmaller magnitudes. The CNV detection tool,in principle, is capable of predicting significantfold-changes of any magnitude, which may oc-cur in large numbers in cancer-affected samplesdue to tumor heterogeneity.

As was also observed for Excavator [12], thespecificity of the CNV detection tool was nothigh when evaluated against the SNP array re-sults. With the fold-change cutoff set to 1.4,

only 34% of the targets predicted to be affectedby the CNV detection tool were also called in theSNP array approach. In comparison, 35% of thetargets predicted by Excavator were also calledin the SNP array approach (data not shown).With the fold-change cutoff reduced to 1.2, theagreement reduced even further to 17%, indicat-ing that fewer than 1 in 5 targets called by theCNV detection tool were also among the SNParray calls. However, as discussed in [12], thelow degree of specificity of the NGS-based meth-ods compared to the SNP array-based methodis more likely to be a result of higher sensitivityrather than a sign of lower specificity in theNGS-based methods.

Overall, as Excavator and the CNV detection toolare based on very different algorithms, such adegree of agreement between these two toolsgreatly increases our confidence that the sensi-tivity and the specificity of both tools are indeedvery high for whole-exome sequencing data.

4.3.3 Benchmark 3

This clinical sequencing dataset was generatedby Foundation Medicine, Inc., and recently pub-lished as part of a larger study describing a clin-ical pipeline based on NGS [13]. This titrationdataset enables the evaluation of the sensitivityof CNV prediction in cancer samples at tumorpurities of less than 100%. As only gene-levelresults were provided in that study, and target-level resolution were not available, we also onlycarried out gene-level analysis in this case. Ourresults can be found in Section A.3.

The evaluation of the sensitivity and specificityof predictions on this dataset was challengingdue to several factors. Firstly, the CNVs werenot verified using a non-NGS technology, andthe published study [13] only reported homozy-gous deletions or amplifications of ≥ 6 copies.(This corresponds to an observed fold-change of3 for a case sample with 100% sample purity,or a fold-change of just 1.4 in a sample with20% sample purity). Furthermore, the specificityof the predictions was not evaluated at all.

Nevertheless, the CNV detection tool was capa-

P. 13


ble of identifying every affected gene reportedby [13] when the sample purity was 50% orgreater, and the sensitivity of detection wasonly reduced below 96% when the sample pu-rity was lower than 30%. This suggests thatthe CNV detection tool is highly sensitive inthe gene-level detection of CNVs in this clin-ical sequencing dataset, even in the case oflow-purity samples. (In a separate experiment,when the fold-change cutoff was reduced to 1.2,a sensitivity of 100% was obtained by the CNVdetection tool even in the lowest-purity sample- data not shown.)

The fraction of called targets also called by thecited study is generally rather low, particularlywhen evaluated using the PPV metric. However,as in the case of Benchmark 2, we believethis is due to an increased sensitivity in ourmethod, especially at higher sample purities:indeed, at a sample purity of 100%, the CNVdetection tool with a fold-change cutoff of 1.4 issensitive enough to call a heterozygous amplifi-cation (three copies), in comparison to the six ormore copies that are required in the publishedstudy [13].

5 ConclusionsThe CNV detection tool in the Biomedical Ge-nomics Workbench is capable of accurate iden-tification of copy number variations in a broadrange of next-generation sequencing data. Innearly all of our benchmarks, genes which wereaffected by copy number variations were de-tected with a 100% sensitivity, even at reducedsample purities. The target-level sensitivitieswere similarly over 95% under most conditions.Our benchmarks using samples affected by in-herited diseases indicate that the specificity ofthe CNV detection tool also approaches 100%.The specificity of the tool is more challengingto evaluate on cancer datasets, particularly dueto the lack of gold standard datasets. However,an analysis of the agreement between our tooland other methods shows that the specificity ofthe CNV detection tool is comparable to otherstate-of-the-art tools, such as Excavator [12].

The CNV detection tool has several advantagesover other tools currently available for the pre-diction of CNVs. Firstly, it reports fold-changeson a continuum, enabling the prediction of CNVsat less than 100% sample purity, and placesno fundamental restriction on the predictableeffect size. Secondly, it includes an in-built seg-mentation step, enabling the determination ofa minimum size for the detected CNVs. Impor-tantly, as the CNV detection tool is an in-builtfunctionality of the Biomedical Genomics Work-bench, the results of the predictions can beeasily visualized alongside the read mappingsand the results of other tools in the GenomeBrowser View.

The main limitation of the CNV detection tool isthat it requires the presence of unaffected tar-gets for optimal performance, as it’s based ona statistical method where an equal sequencingdepth is not required in the case and controlsamples. The normalization procedure modelsthe differences in sequencing depth betweenthe case and control samples, and the CNVdetection tool depends on this model to recog-nize biological CNVs as coverage outliers. Thismethod works most effectively when a largenumber of unaffected targets is available, soan accurate model can be computed for thesequencing depth. For the same reason, paneldesign is of high importance, and the tool is notsuitable for the analysis of datasets where onlya single gene was sequenced.

In conclusion, the CNV detection tool is a versa-tile and accurate method for the identificationof copy number variants in NGS data, and ithas the potential to provide valuable insights inbasic medical research, as well as translationalclinical settings.

6 AcknowledgementsWe would like to thank R. Yelensky from (Foun-dation Medicine, Inc.) for sharing the targetfile for Benchmark 3 with us, and our QIA-GEN colleague F. Reinecke for the data used inBenchmark 2 and for useful conversations.

P. 14


References[1] Conrad DF et al. Origins and functional impact

of copy number variation in the human genome.Nature. 2010, 464(7289):704-12.

[2] Ni X et al. Reproducible copy number variationpatterns among single circulating tumor cells oflung cancer patients. Proceedings of the Na-tional Academy of Sciences of the United Statesof America. 2013, 110(52):21083-8.

[3] Girirajan S, Campbell CD, and Eichler EE. Humancopy number variation and complex genetic dis-ease. Annual Review of Genetics. 2011, 45:203-226.

[4] Zack TI et al. Pan-cancer patterns of somaticcopy number alteration. Nature Genetics. 2013,45(10):1134-1140.

[5] Carter NP. Methods and strategies for analyzingcopy number variation using DNA microarrays.Nature Genetics. 2007, 39:S16-S21

[6] Zhao M et al. Computational tools for copynumber variation (CNV) detection using next-generation sequencing data: features and per-spectives. BMC Bioinformatics. 2013, 14(Suppl11):S1.

[7] Tan R et al. An evaluation of copy number vari-ation detection tools from whole-exome sequenc-

ing data. Human Mutation. 2014, 35(7):899-907.

[8] Teo SM, et al. Statistical challenges associatedwith detecting copy number variations with next-generation sequencing. Bioinformatics. 2012,28(21):2711-2718.

[9] Li et al. CONTRA: copy number analysis fortargeted resequencing, Bioinformatics. 2012,28(10):1307-1313.

[10] Niu YS and Zhang H. The screening and rankingalgorithm to detect DNA copy number variations.,Annals of Applied Statistics. 2012, 6(3): 1306-1326.

[11] Reinecke F, Satya RV and DiCarlo J. Quantita-tive analysis of differences in copy numbers us-ing read depth obtained from PCR-enriched sam-ples and controls., BMC Bioinformatics. 2015,16:17.

[12] Magi A et al. EXCAVATOR: detecting copy num-ber variants from whole-exome sequencing data.,Genome Biology. 2013, 14(10):R120.

[13] Frampton GM et al., Development and valida-tion of a clinical cancer genomic profiling testbased on massively parallel DNA sequencing.,Nature Biotechnology. 2013, 31:1023-1031.

P. 15


A Detailed sample-level results.

A.1 Benchmark 1: Deep sequencing of inherited disease samples after PCR-basedtarget-enrichment

Benchmark 1: gene-level resultsSample TP TN FP FN Sensitivity Specificity NPV PPV

M62-1 (NA01201) 4 36 0 0 1.00 1.00 1.00 1.00M62-3 (NA05067) 2 38 0 0 1.00 1.00 1.00 1.00M62-4 (NA09888) 5 33 2 0 1.00 0.94 1.00 0.71M62-5 (NA11672) 3 37 0 0 1.00 1.00 1.00 1.00M62-6 (NA12606) 3 36 1 0 1.00 0.97 1.00 0.75M62-8 (NA13783) 4 36 0 0 1.00 1.00 1.00 1.00M62-9 (NA14164) 2 37 1 0 1.00 0.97 1.00 0.67M62-10 (NA20022) 2 38 0 0 1.00 1.00 1.00 1.00Total, M62 25 291 4 0 1.00 0.99 1.00 0.86



All samples 85 961 14 0 1.00 0.99 1.00 0.86

P. 16


Benchmark 1: target-level resultsSample TP TN FP FN Sensitivity Specificity NPV PPV




All samples 1549 16003 80 80 0.95 1.00 1.00 0.96

P. 17


A.2 Benchmark 2: Whole-exome sequencing of melanoma.

Benchmark 2: target-level resultsFold-change cutoff: 1.4, overlap with Excavator resultsSample TP TN FP FN Sensitivity Specificity NPV PPV

Me01 (ERR174231) 9268 154120 7707 503 0.95 0.95 1.00 0.55Me02 (ERR174232) 48275 119777 788 2758 0.95 0.99 0.98 0.98Me04 (ERR174233) 36752 130974 2484 1388 0.96 0.98 0.99 0.94Me05 (ERR174234) 56278 109576 1099 4645 0.92 0.99 0.96 0.98Me08 (ERR174235) 49382 118002 932 3282 0.94 0.99 0.97 0.98Me12 (ERR174236) 22901 141723 6606 368 0.98 0.96 1.00 0.78Total 222856 774172 19616 12944 0.95 0.98 0.98 0.92

Fold-change cutoff: 1.4, overlap with SNP array resultsSample TP TN FP FN Sensitivity Specificity NPV PPV


Fold-change cutoff: 1.2, overlap with Excavator resultsSample TP TN FP FN Sensitivity Specificity NPV PPV


Fold-change cutoff: 1.2, overlap with SNP array resultsSample TP TN FP FN Sensitivity Specificity NPV PPV


P. 18


A.3 Benchmark 3: Clinical cancer genomic profiling.

Benchmark 3: gene-level results, ≥30% tumor purity100% tumor puritySample TP TN FP FN Sensitivity Specificity NPV PPV

SRR948953 9 189 74 0 1.00 0.72 1.00 0.11SRR948959 2 187 83 0 1.00 0.69 1.00 0.02SRR948965 2 209 61 0 1.00 0.77 1.00 0.03SRR948971 1 215 56 0 1.00 0.79 1.00 0.02SRR948977 5 225 42 0 1.00 0.84 1.00 0.11SRR948983 4 237 31 0 1.00 0.88 1.00 0.11SRR948989 5 187 80 0 1.00 0.70 1.00 0.06Total, 100% tumor 28 1449 427 0 1.00 0.77 1.00 0.0675% tumor puritySample TP TN FP FN Sensitivity Specificity NPV PPV




SRR948957 8 256 7 1 0.89 0.97 1.00 0.53SRR948963 2 257 13 0 1.00 0.95 1.00 0.13SRR948969 2 262 8 0 1.00 0.97 1.00 0.20SRR948975 1 271 0 0 1.00 1.00 1.00 1.00SRR948981 5 264 3 0 1.00 0.99 1.00 0.62SRR948987 4 227 41 0 1.00 0.85 1.00 0.09SRR948993 5 264 3 0 1.00 0.99 1.00 0.62Total, 30% tumor 27 1801 75 1 0.96 0.96 1.00 0.26Total, ≥30% tumor 138 8334 1046 2 0.99 0.89 1.00 0.12

P. 19


Benchmark 3: gene-level results, <30% tumor purity20% tumor puritySample TP TN FP FN Sensitivity Specificity NPV PPV

SRR948958 8 256 7 1 0.89 0.97 1.00 0.53SRR948964 1 263 7 1 0.50 0.97 1.00 0.12SRR948970 2 269 1 0 1.00 1.00 1.00 0.67SRR948976 1 270 1 0 1.00 1.00 1.00 0.50SRR948982 2 266 1 3 0.40 1.00 0.99 0.67SRR948988 4 254 14 0 1.00 0.95 1.00 0.22SRR948994 0 267 0 5 0.00 1.00 0.98 NATotal, 20% tumor 18 1845 31 10 0.64 0.98 0.99 0.37

Total, <30% tumor 18 1845 31 10 0.64 0.98 0.99 0.37

P. 20

Documents

White Paper - Qiagen · White paper: Copy number variant detection Section3.2). Every target is then characterized by the average log-coverage in the case sample and in the baseline