19
Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John Wiedenhoeft, Eric Brugel, Alexander Schliep Department of Computer Science Rutgers University {john.wiedenhoeft, schliep}@cs.rutgers.edu July 31, 2015 Abstract By combining Haar wavelets with Bayesian Hidden Markov Models, we improve detection of genomic copy number variants (CNV) in array CGH experiments compared to the state-of-the-art, including standard Gibbs sampling. At the same time, we achieve drastically reduced running times, as the method concentrates computational effort on chro- mosomal segments which are difficult to call, by dynam- ically and adaptively recomputing consecutive blocks of observations likely to share a copy number. This makes routine diagnostic use and re-analysis of legacy data col- lections feasible; to this end, we also propose an effective automatic prior. An open source software implementation of our method is available at http://bioinformatics. rutgers.edu/Software/HaMMLET/. The web supple- ment is at http://bioinformatics.rutgers.edu/ Supplements/HaMMLET/ Background The human genome shows remarkable plasticity, leading to significant copy number variations (CNV) within the hu- man population [1]. They contribute to differences in phe- notype [24], ranging from benign variation over disease susceptibility to inherited or somatic diseases [5], including neuropsychiatric disorders [68] and cancer [9, 10]. Sepa- rating common from rare variants is important in the study of genetic diseases [5, 11, 12], and while the experimental platforms have matured, interpretation and assessment of pathogenic significance remains a challenge [13]. Computationally, CNV detection is a segmentation prob- lem, in which consecutive stretches of the genome are to be labeled by their copy number. Along with a variety of other methods [1440], Hidden Markov Models (HMM) [41] play a central role [4252], as they directly model the separate layers of observed measurements, such as log-ratios in ar- ray CGH, and their corresponding latent copy number (CN) states, as well as the underlying linear structure of segments. As statistical models, they depend on a large number of parameters, which have to be either provided a priori by the user or inferred from the data. Classic frequentist maximum likelihood (ML) techniques like Baum-Welch [53, 54] are not guaranteed to be globally optimal, i. e. they can con- verge to the wrong parameter values, which can limit the accuracy of the segmentation. Furthermore, the Viterbi algo- rithm [55] only yields a single maximum a posteriori (MAP) segmentation given a parameter estimate [56]. Failure to consider the full set of possible parameters precludes alter- native interpretations of the data, and makes it very difficult to derive p-values or confidence intervals. Furthermore, these frequentist techniques have come under increased scrutiny in the scientific community. Bayesian inference techniques for HMMs, in particular Forward-Backward Gibbs sampling [57, 58], provide an alternative for CNV detection as well [5961]. Most impor- tantly, they yield a complete probability distribution of copy numbers for each observation. As they are sampling-based, they are computationally expensive, which is problematic especially for high-resolution data. Though they are guar- anteed to converge to the correct values under very mild assumptions, they tend to do so slowly, which can lead to oversegmentation and mislabeling if the sampler is stopped prematurely. Another issue in practice is the requirement to specify hyperparameters for the prior distributions. Despite the theoretical advantage of making the inductive bias more explicit, this can be a major source of annoyance for the user. It is also hard to justify any choice of hyperparameters when insufficient domain knowledge is available. Recent work of our group [62] has focused on accelerating Forward-Backward Gibbs sampling through the introduction of compressed HMMs and approximate sampling. For the first time, Bayesian inference could be performed at running times on par with classic maximum likelihood approaches. It was based on a greedy spatial clustering heuristic, which yielded a static compression of the data into blocks, and block-wise sampling. Despite its success, several important issues remain to be addressed. The blocks are fixed through- out the sampling and impose a structure that turns out to be too rigid in the presence of variances differing between 1 . CC-BY-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted July 31, 2015. . https://doi.org/10.1101/023705 doi: bioRxiv preprint

Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

Fast Bayesian Inference of Copy Number Variants using HiddenMarkov Models with Wavelet Compression

John Wiedenhoeft Eric Brugel Alexander SchliepDepartment of Computer Science

Rutgers Universityjohnwiedenhoeft schliepcsrutgersedu

July 31 2015

Abstract

By combining Haar wavelets with Bayesian Hidden MarkovModels we improve detection of genomic copy numbervariants (CNV) in array CGH experiments compared to thestate-of-the-art including standard Gibbs sampling At thesame time we achieve drastically reduced running timesas the method concentrates computational effort on chro-mosomal segments which are difficult to call by dynam-ically and adaptively recomputing consecutive blocks ofobservations likely to share a copy number This makesroutine diagnostic use and re-analysis of legacy data col-lections feasible to this end we also propose an effectiveautomatic prior An open source software implementationof our method is available at httpbioinformaticsrutgerseduSoftwareHaMMLET The web supple-ment is at httpbioinformaticsrutgerseduSupplementsHaMMLET

Background

The human genome shows remarkable plasticity leadingto significant copy number variations (CNV) within the hu-man population [1] They contribute to differences in phe-notype [2ndash4] ranging from benign variation over diseasesusceptibility to inherited or somatic diseases [5] includingneuropsychiatric disorders [6ndash8] and cancer [9 10] Sepa-rating common from rare variants is important in the studyof genetic diseases [5 11 12] and while the experimentalplatforms have matured interpretation and assessment ofpathogenic significance remains a challenge [13]

Computationally CNV detection is a segmentation prob-lem in which consecutive stretches of the genome are to belabeled by their copy number Along with a variety of othermethods [14ndash40] Hidden Markov Models (HMM) [41] playa central role [42ndash52] as they directly model the separatelayers of observed measurements such as log-ratios in ar-ray CGH and their corresponding latent copy number (CN)states as well as the underlying linear structure of segments

As statistical models they depend on a large number of

parameters which have to be either provided a priori by theuser or inferred from the data Classic frequentist maximumlikelihood (ML) techniques like Baum-Welch [53 54] arenot guaranteed to be globally optimal i e they can con-verge to the wrong parameter values which can limit theaccuracy of the segmentation Furthermore the Viterbi algo-rithm [55] only yields a single maximum a posteriori (MAP)segmentation given a parameter estimate [56] Failure toconsider the full set of possible parameters precludes alter-native interpretations of the data and makes it very difficultto derive p-values or confidence intervals Furthermorethese frequentist techniques have come under increasedscrutiny in the scientific community

Bayesian inference techniques for HMMs in particularForward-Backward Gibbs sampling [57 58] provide analternative for CNV detection as well [59ndash61] Most impor-tantly they yield a complete probability distribution of copynumbers for each observation As they are sampling-basedthey are computationally expensive which is problematicespecially for high-resolution data Though they are guar-anteed to converge to the correct values under very mildassumptions they tend to do so slowly which can lead tooversegmentation and mislabeling if the sampler is stoppedprematurely

Another issue in practice is the requirement to specifyhyperparameters for the prior distributions Despite thetheoretical advantage of making the inductive bias moreexplicit this can be a major source of annoyance for theuser It is also hard to justify any choice of hyperparameterswhen insufficient domain knowledge is available

Recent work of our group [62] has focused on acceleratingForward-Backward Gibbs sampling through the introductionof compressed HMMs and approximate sampling For thefirst time Bayesian inference could be performed at runningtimes on par with classic maximum likelihood approachesIt was based on a greedy spatial clustering heuristic whichyielded a static compression of the data into blocks andblock-wise sampling Despite its success several importantissues remain to be addressed The blocks are fixed through-out the sampling and impose a structure that turns out tobe too rigid in the presence of variances differing between

1

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Genome position

(a) Standard FB Gibbs

Log-

ratio

s

Per-position latent variables

Genome position

(b) Wavelet Compression

Log-

ratio

s

Per-block latent variables

Sufficientstatistics

Genome position

Small minimum variance

Log-

ratio

s

Genome position

Large minimum varianceLo

g-ra

tios

Compute wavelet compression for minimum variance in θ

Aθq

Currentparameters

Newparameters

Sample transition matrix and emission parameters

Sample new statesequence q

Waveletblocks ampsufficientstatistics

(c) FB Gibbs sampling Loop

Figure 1 Overview of HaMMLET Instead of individual computations per observation (panel a) Forward-Backward Gibbs Samplingis performed on a compressed version of the data using sufficient statistics for block-wise computations (panel b) to accelerateinference in Bayesian Hidden Markov Models During the sampling (panel c) parameters and copy number sequences are samplediteratively During each iteration the sampled variances determine which coefficients of the datarsquos Haar wavelet transform aredynamically set to zero This controls potential break points at finer or coarser resolution or equivalently defines blocks of variablenumber and size (panel c bottom) Our approach thus yields a dynamic adaptive compression scheme which greatly improves speedof convergence accuracy and running times

2

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

CN states The clustering heuristic relies on empirically de-rived parameters not supported by a theoretical analysiswhich can lead to suboptimal clustering or overfitting Alsothe method cannot easily be generalized for multivariatedata Lastly the implementation was primarily aimed atcomparative analysis between the frequentist and Bayesianapproach as opposed to overall speed

To address these issues and make Bayesian CNV inferencefeasible even on a laptop we propose the combination ofHMMs with another popular signal processing technologyHaar wavelets have previously been used in CNV detection[63] mostly as a preprocessing tool for statistical down-stream applications [24ndash28] or simply as a visual aid inGUI applications [17 64] A combination of smoothing andsegmentation has been suggested as likely to improve re-sults [65] and here we show that this is indeed the caseThe wavelets provide a theoretical foundation for a betterdynamic compression scheme for faster convergence andaccelerated Bayesian sampling We improve simultaneouslyupon the typically conflicting goals of accuracy and speedbecause the wavelets allow summary treatment of ldquoeasyrdquoCN calls in segments and focus computational effort on theldquodifficultrdquo CN calls dynamically and adaptively This is incontrast to other computationally efficient tools which oftensimplify the statistical model or use heuristics The requireddata structure can be efficiently computed incurs minimaloverhead and has a straightforward generalization to mul-tivariate data We further show how the wavelet transformyields a natural way to set hyperparameters automaticallywith little input from the user

We implemented our method in a highly optimized end-user software called HaMMLET Aside from achieving anacceleration of up to two orders of magnitude it exhibitssignificantly improved convergence behavior has excellentprecision and recall and provides Bayesian inference withinseconds even for large data sets The accuracy and speedof HaMMLET also makes it an excellent choice for routinediagnostic use and large-scale re-analysis of legacy data

Results and discussion

Simulated aCGH data

A previous survey [65] of eleven CNV calling methods foraCGH has established that segmentation-focused methodssuch as DNAcopy [32 33] an implementation of circularbinary segmentation (CBS) as well as CGHseg [35] performconsistently well DNAcopy performs a number of t-tests todetect break-point candidates The result is typically over-segmented and requires a merging step in post-processingespecially to reduce the number of segment means To thisend MergeLevels was introduced by [66] They comparethe combination DNAcopy+MergeLevels to their own HMMimplementation [45] as well as GLAD [23] showing itssuperior performance over both methods This establishedDNAcopy+MergeLevels as the de facto standard in CNV

lt5 5-10 10-20 gt20

00

02

04

06

08

10

BDampM SRS T

Figure 2 F-measures of CBS (light) and HaMMLET (dark) forcalling aberrant copy numbers on simulate aCGH data Boxesrepresent the interquartile range (IQR = Q3-Q1) with a horizontalline showing the median (Q2) whiskers representing the range( 3

2 IQR beyond Q1 and Q3) and the bullet representing the meanHaMMLET has the same or better F-measures in most cases andconverges to 1 for larger segments whereas CBS plateaus foraberrations greater than 10

detection despite the comparatively long running timeThe paper also includes aCGH simulations deemed to

be reasonably realistic by the community DNACopywas used to segment 145 unpublished samples of breastcancer data and subsequently labeled as copy num-bers 0 to 5 by sorting them into bins with boundaries(minusinfinminus04minus02020406infin) based on the samplemean in each segment (the last bin appears to not beused) Empirical length distributions were derived fromwhich the sizes of CN aberrations are drawn The data it-self is modeled to include Gaussian noise which has beenestablished as a sufficient for aCGH data [67] Meanswere generated such as to mimic random tumor cell pro-portions and random variances were chosen to simulateexperimenter bias often observed in real data this em-phasizes the importance of having automatic priors avail-able when using Bayesian methods as the means and vari-ances might be unknown a priori The simulations areavailable at httpwwwcbsdtudk~hanniaCGHThey comprise three sets of simulations ldquobreakpoint de-tection and mergingrdquo (BDampM) ldquospatial resolution studyrdquo(SRS) and ldquotestingrdquo (T) (see their paper for details) Weused the MergeLevels implementation as provided on theirwebsite It should be noted that the superiority of DNA-copy+MergeLevels was established using a simulation basedupon segmentation results of DNAcopy itself

We used the Bioconductor package DNAcopy (version1240) and followed the procedure suggested therein in-cluding outlier smoothing This version uses the linear-timevariety of CBS [34] note that other authors such as [31]compare against its quadratic-time version [33] which is

3

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

likely to yield a favorable comparison and speedups of sev-eral orders of magnitude especially on large data ForHaMMLET we use a 5-state model with automatic hyperpa-rameters P(σ2 le 001) = 09 (see section Automatic priors)and all Dirichlet hyperparameters set to 1

We report F-measures for detecting aberrant segments(Fig 2) On datasets T and BDampM both methods have simi-lar medians but HaMMLET has a much better interquartilerange (IQR) and range about half of CBSrsquos On the spa-tial resolution data set (SRS) HaMMLET performs muchbetter on very small aberrations This might seem some-what surprising as short segments could easily get lostunder compression However Lai et al [65] have notedthat smoothing-based methods such as quantile smoothing(quantreg) [19] lowess [20] and wavelet smoothing [25]perform particularly well in the presence of high noise andsmall CN aberrations suggesting that ldquoan optimal combi-nation of the smoothing step and the segmentation stepmay result in improved performancerdquo Our wavelet-basedcompression inherits those properties For CNVs of sizesbetween 5 and 10 CBS and HaMMLET have similar rangeswith CBS being skewed towards better values CBS has aslightly higher median for 10ndash20 with IQR and range beingabout the same However while HaMMLETrsquos F-measureconsistently approaches 1 for larger aberrations CBS doesnot appear to significantly improve after size 10

High-density CGH array

In this section we demonstrate HaMMLETrsquos performanceon biological data Due to the lack of a gold standard forhigh-resolution platforms we assess the CNV calls quali-tatively We use raw aCGH data (GEOGSE23949) [68] ofgenomic DNA from breast cancer cell line BT-474 (invasiveductal carcinoma GEOGSM590105) on an Agilent-021529Human CGH Whole Genome Microarray 1x1M platform(GEOGPL8736) We excluded gonosomes mitochondrialand random chromosomes from the data leaving 966432probes in total

HaMMLET allows for using automatic emission priors (seesection Automatic priors) by specifying a noise variance anda probability to sample a variance not exceeding this valueWe compare HaMMLETrsquos performance against CBS usinga 20-state model with automatic priors P(σ2 le 01) = 0810 prior self-transitions and 1 for all other hyperparametersCBS took over 2 h 9 m to process the entire array whereasHaMMLET took 271 s for 100 iterations a speedup of 288The compression ratio (see section Speed and convergenceeffects of wavelet compression) was 2203 CBS yielded amassive over-segmentation into 1548 different copy num-ber levels As the data is derived from a relatively homo-geneous cell line as opposed to a biopsy we do not expectthe presence of subclonal populations to be a contributingfactor [69 70] Instead measurements on aCGH are knownto be spatially correlated resulting in a wave pattern whichhas to be removed in a preprocessing step CBS performs

such a smoothing yet the unrealistically large number ofdifferent levels is likely due to residuals of said wave pat-tern Repeated runs of CBS yielded different numbers oflevels suggesting that indeed the merging was incompleteThis further highlights the importance of prior knowledgein statistical inference By limiting the number of CN statesa priori non-iid effects are easily remedied and inferencecan be performed without preprocessing Furthermore theinternal compression mechanism of HaMMLET is derivedfrom a spatially adaptive regression method so smoothingis inherent to our method The results are included in thesupplement

For a more comprehensive analysis we restricted ourevaluation to chromosome 20 (21687 probes) which weassessed to be the most complicated to infer as it appears tohave the highest number of different CN states and break-points CBS yields a 19-state result after 1578 s (Fig 3top) We have then used a 19-state model automated priors(P(σ2 le 004) = 09) 10 prior self-transitions all otherDirichlet parameters set to 1) to reproduce this result Us-ing noise control (see Methods) our method took 161 sfor 600 iterations The solution we obtained is consistentwith CBS (Fig 3 middle and bottom) However only 11states were part of the final solution i e 6 states yieldedno significant likelihood above that of other states We ob-serve superfluous states being ignored in our simulations aswell cf Supplement In light of the results on the entirearray we suggest that the segmentation by DNAcopy hasnot sufficiently been merged by MergeLevels Most strik-ingly HaMMLET does not show any marginal support fora segment called by CBS around probe number 4500 Wehave confirmed that this is not due to data compressionas the segment is broken up into multiple blocks in eachiteration (data not shown) On the other hand two muchsmaller segments called by CBS in the 17000ndash20000 rangedo have marginal support of about 40 in HaMMLET sug-gesting that the lack of support for the larger segment iscorrect It should be noted that inference differs betweenthe entire array and chromosome 20 in both methods sincelong-range effects have higher impact in larger data

We also demonstrate another feature of HaMMLET callednoise control While Gaussian emissions have been deemed asufficiently accurate noise model for aCGH [67] microarraydata is prone to outliers for example due to damages on thechip While it is possible to model outliers directly [60] thecharacteristics of the wavelet transform allow us to largelysuppress them during the construction of our data structure(see Methods) Notice that due to noise control most outliersare correctly labeled according to the segment they occurin while the short gain segment close to the beginning iscalled correctly

4

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Figure 3 Copy number inference for chromosome 20 in invasive ductal carcinoma (21687 probes) CBS creates a 19-state solution(top) however a compressed 19-state HMM only supports an 11-state solution (bottom) suggesting insufficient level merging in CBS

5

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Speed and convergence effects of wavelet com-pression

The speedup gained by compression depends on how wellthe data can be compressed Poor compression is expectedwhen the means are not well separated or short segmentshave small variance which necessitates the creation ofsmaller blocks for the rest of the data to expose potentiallow-variance segments to the sampler On the other handdata must not be over-compressed to avoid merging smallaberrations with normal segments which would decreasethe F-measure The level of compression is measured withthe compression ratio Due to the dynamic changes to theblock structure we define compression ratio as the totalnumber of blocks in all iterations divided by the productof the number of data points T and the number of itera-tions N As usual a compression ratio of one indicates nocompression

To evaluate the impact of dynamic wavelet compressionon speed and convergence properties of an HMM wecreated 129600 different data sets with T = 32768 manyprobes In each data set we randomly distributed 1 to6 gains of a total length of 1002505007501000uniformly among the data and do the same for lossesMean combinations (microlossmicroneutralmicrogain) were cho-sen from (log2

12 log2 1 log2

32 ) (minus101) (minus202)

and (minus10 0 10) and variances (σ2lossσ

2neutralσ

2gain)

from (005005005) (050109) (030201)(020103) (010302) and (010101) Thesevalues have been selected to yield a wide range of easyand hard cases both well separated low-variance datawith large aberrant segments as well as cases in whichsmall aberrations overlap significantly with the tailsamples of high-variance neutral segments Consequentlycompression ratios range from sim1 to sim2100 We useautomatic priors P(σ2 le 02) = 09 self-transition priorsαii isin 101001000 non-self transition priors αi j = 1and initial state priors α isin 110 Using all possiblecombinations of the above yields 129600 differentsimulated data sets a total of 42 billion values

We achieve speedups per iteration of up to 350 com-pared to an uncompressed HMM (Fig 5) In contrast [62]have reported ratios of 10ndash60 with one instance of 90 No-tice that the speedup is not linear in the compression ratioWhile sampling itself is expected to yield linear speedup themarginal counts still have to be tallied individually for eachposition and dynamic block creation causes some overheadThe quantization artifacts observed for larger speedup arelikely due to the limited resolution of the Linux time com-mand (10 milliseconds) Compressed HaMMLET took about112 CPU hours for all 129600 simulations whereas theuncompressed version took over 3 weeks and 5 days Allrunning times reported are CPU time measured on a singlecore of a AMD OpteronTM 6174 Processor clocked at 22GHz

We evaluate the convergence of the F-measure of com-

Figure 5 HaMMLETrsquos speedup as a function of the average com-pression during sampling As expected higher compression leadsto greater speedup The non-linear characteristic is due to thefact that some overhead is incurred by the dynamic compressionas well as parts of the implementation that do not depend on thecompression such as tallying marginal counts

pressed and uncompressed inference for each simulationSince we are dealing with multi-class classification we usethe micro- and macro-averaged F-measures (Fmi Fma) pro-posed by [71] they tend to be dominated by the classifierrsquosperformance on common and rare categories respectivelySince all state labels are sampled from the same prior andhence their relative order is random we used the labelpermutation which yielded the highest sum of micro- andmacro-averaged F-measures

In Fig 4 we show that the compressed version of theGibbs sampler converges almost instantly whereas the un-compressed version converges much slower with about5 of the cases failing to yield an F-measure gt 06 within1000 iterations Wavelet compression is likely to yield rea-sonably large blocks for the majority class early on whichleads to a strong posterior estimate of its parameters andself-transition probabilities As expected Fma are generallyworse since any misclassification in a rare class has a largerimpact Especially in the uncompressed version we observethat Fma tends to plateau until Fmi approaches 10 Sinceany misclassification in the majority (neutral) class addsfalse positives to the minority classes this effect is expectedIt implies that correct labeling of the majority class is a nec-essary condition for correct labeling of minority classes inother words correct identification of the rare interestingsegments requires the sampler to properly converge whichis much harder to achieve without compression It shouldbe noted that running compressed HaMMLET for 1000iterations is unnecessary on the simulated data as in allcases it converges between 25 and 50 iterations Thus forall practical purposes an additional speedup of 40ndash80 canbe achieved by reducing the number of iterations whichyields convergence up to 3 orders of magnitude faster thanstandard FBG

6

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

00

02

04

06

08

10

Mic

ro-a

vera

ged F

-measu

re

Uncompressed Compressed

0 200 400 600 800 1000Iteration

00

02

04

06

08

10

Macr

o-a

vera

ged F

-measu

re

0 200 400 600 800 1000Iteration

0 25

08

10

0 25

08

10

Figure 4 The median value (black) and quantile ranges (in 5 steps) of the micro- (top) and macro-averaged (bottom) F-measuresfor uncompressed (left) and compressed (right) FBG inference on the same 129600 simulated data sets using automatic priors Thex-axis represents the number of iterations alone and does not reflect the additional speedup obtained through compression Noticethat the compressed HMM converges no later than 50 iterations (inset figures right)

7

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Coriell ATCC and breast carcinoma

The data provided by [72] includes 15 aCGH samples forthe Coriell cell line At about 2000 probes the data issmall compared to modern high-density arrays Neverthe-less these data sets have become a common standard toevaluate CNV calling methods as they contain few and sim-ple aberrations The data also contains 6 ATCC cell lines aswell as 4 breast carcinoma all of which are substantiallymore complicated and typically not used in software eval-uations In Fig 6 we demonstrate our ability to infer thecorrect segments on the most complex example a T47Dbreast ductal carcinoma sample of a 54 year old female Weused 6-state automatic priors with P(σ2 le 01) = 085 andall Dirichlet hyperparameters set to 1 On a standard laptopHaMMLET took 009 seconds for 1000 iterations runningtimes for the other samples were similar Our results for all25 data sets have been included in the supplement

Conclusions

In the analysis of biological data there are usually con-flicting objectives at play which need to be balanced therequired accuracy of the analysis ease of usemdashusing thesoftware setting software and method parametersmdashandoften the speed of a method Bayesian methods have ob-tained a reputation of requiring enormous computationaleffort and being difficult to use for the expert knowledgerequired for choosing prior distributions It has also beenrecognized [60 62 73] that they are very powerful andaccurate leading to improved high-quality results and pro-viding in the form of posterior distributions an accuratemeasure of uncertainty in results Nevertheless it is not sur-prising that a hundred times larger effort in computationaleffort alone prevented wide-spread use

Inferring Copy Number Variants (CNV) is a quite specialproblem as experts can identify CN changes visually atleast on very good data and for large drastic CN changes(e g long segments lost on both chromosomal copies)With lesser quality data smaller CN differences and in theanalysis of cohorts the need for objective highly accurateand automated methods are evident

The core idea for our method expands on our prior work[62] and affirms a conjecture by Lai et al [65] that a com-bination of smoothing and segmentation will likely improveresults One ingredient of our method are Haar waveletswhich have previously been used for pre-processing andvisualization [17 64] In a sense they quantify and identifyregions of high variation and allow to summarize the dataat various levels of resolution somewhat similar to howan expert would perform a visual analysis We combinefor the first time wavelets with a full Bayesian HMM bydynamically and adaptively infering blocks of subsequentobservations from our wavelet data structure The HMMoperates on blocks instead of individual observations whichleads to great saving in running times up to 350-fold com-

pared to the standard FB-Gibbs sampler and up to 288times faster than CBS Much more importantly operatingon the blocks greatly improves convergence of the samplerleading to higher accuracy for a much smaller number ofsampling iterations Thus the combination of wavelets andHMM realizes a simultaneous improvement on accuracyand on speed typically one can have one or the other Anintuitive explanation as to why this works is that the blocksderived from the wavelet structure allow efficient summarytreatment of those ldquoeasyrdquo to call segments given the currentsample of HMM parameters and identifies ldquohardrdquo to callCN segment which need the full computational effort fromFB-Gibbs Note that it is absolutely crucial that the blockstructure depends on the parameters sampled for the HMMand will change drastically over the run time This is incontrast to our prior work [62] which used static blocksand saw no improvements to accuracy and convergencespeed The data structures and linear-time algorithms weintroduce here provide the efficient means for recomputingthese blocks at every cycle of the sampling cf Fig 1 Com-pared to our prior work we observe a speed-up of up to3000 due to the greatly improved convergence O(T) vsO(T log T ) clustering improved numerics and lastly a C++instead of a Python implementation

We performed an extensive comparison with the state-of-the-art as identified by several review and benchmarkpublications and with the standard FB-Gibbs sampler on awide range of biological data sets and 129600 simulateddata sets which were produced by a simulation processnot based on HMM to make it a harder problem for ourmethod All comparisons demonstrated favorable resultsfor our method when measuring accuracy at a very notice-able acceleration compared to the state-of-the-art It mustbe stressed that these results were obtained with a statisti-cally sophisticated model for CNV calls and without cuttingalgorithmic corners but rather an effective allocation ofcomputational effort

All our computations are performed using our automaticprior which derives estimates for the hyperparameters ofthe priors for means and variances directly from the wavelettree structure and the resulting blocks The block structurealso imposes a prior on self-transition probabilities Theuser only has to provide an estimate of the noise variancebut as the automatic prior is designed to be weak the priorand thus the method is robust against incorrect estimatesWe have demonstrated this by using different hyperparam-eters for the associated Dirichlet priors in our simulationswhich HaMMLET is able to infer correctly regardless ofthe transition priors At the same time the automatic priorcan be used to tune certain aspects of the HMM if strongerprior knowledge is available We would expect further im-provements from non-automatic expert-selected priors butrefrained from using them for the evaluation as they mightbe perceived as unfair to other methods

In summary our method is as easy to use as other sta-tistically less sophisticated tools more accurate and much

8

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Figure 6 HaMMLETrsquos inference of copy-number segments on T47D breast ductal carcinoma Notice that the data is much morecomplex than the simple structure of a diploid majority class with some small aberrations typically observed for Coriell data

9

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

more computationally efficient We believe this makes itan excellent choice both for routine use in clinical settingsand principled re-analysis of large cohorts where the addedaccuracy and the much improved information about uncer-tainty in copy number calls from posterior marginal distri-butions will likely yield improved insights into CNV as asource of genetic variation and its relationship to disease

Methods

We will briefly review Forward-Backward Gibbs sampling(FBG) for Bayesian Hidden Markov Models and its accel-eration through compression of the data into blocks Byfirst considering the case of equal emission variances amongall states we show that optimal compression is equivalentto a concept called selective wavelet reconstruction follow-ing a classic proof in wavelet theory We then argue thatwavelet coefficient thresholding a variance-dependent mini-max estimator allows for compression even in the case ofunequal emission variances This allows the compression ofthe data to be adapted to the prior variance level at eachsampling iteration We then derive a simple data structureto dynamically create blocks with little overhead Whilewavelet approaches have been used for aCGH data before[25 29 30 63] our method provides the first combinationof wavelets and HMMs

Bayesian Hidden Markov Models

Let T be the length of the observation sequence which isequal to the number of probes An HMM can be representedas a statistical model (qA θ |y) with transition matrixA a latent state sequence q = (q0 q1 qTminus1) an observedemission sequence y = (y0 y1 yTminus1) and the emissionparameters θ The initial distribution of q0 is often includedand denoted as π but in a Bayesian setting it makes moresense to take π as a set of hyperparameters for q0

In the usual frequentist approach the state sequence q isinferred by first finding a maximum likelihood estimate ofthe parameters

(AMLθML) = arg max(A θ )

L (A θ |y)

using the Baum-Welsh algorithm [53 54] This is only guar-anteed to yield local optima as the likelihood function isnot convex Repeated random reinitializations are used tofind ldquogoodrdquo local optima but there are no guarantees forthis method Then the most likely state sequence giventhose parameters

q= argmaxqP(q |AMLθMLy)

is calculated using the Viterbi algorithm [55] However ifthere are only a few aberrations that is there is imbalancebetween classes the ML parameters tend to overfit the nor-mal state which is likely to yield incorrect segmentation

[62] Furthermore alternative segmentations given thoseparameters are also ignored as are the ones for alternativeparameters

The Bayesian approach is to calculate the distribution ofstate sequences directly by integrating out the emission andtransition variables

P(q |y) =int

A

int

θ

P(qA θ |y)dθ dA

Since this integral is intractable it has to be approximatedusing Markov Chain Monte Carlo techniques i e drawingN samples

(q(i)A (i)θ (i))sim P(qA θ |y)

and subsequently approximating marginal state probabili-ties by their frequency in the sample

P(qt = s |y)asymp1N

Nsum

i=1

I(q(i)t = s)

Thus for each position t we get a complete probabilitydistribution over the possible states As the marginals ofeach variable are explicitly defined by conditioning on theother variables an HMM lends itself to Gibbs samplingi e repeatedly sampling from the marginals (q |A θ y)(A |qθ y) and (θ |qA y) conditioned on the previouslysampled values Using Bayesrsquo formula and several condi-tional independence relations the sampling process can bewritten as

A sim P(A |q) prop P(q |A )P(A |πA )

θ sim P(θ |qy) prop P(qy |θ )P(θ |πθ ) and

qsim P(q |A yθ ) prop P(A yθ |q)P(q |πq)

where π represents hyperparameters to the prior distribu-tion Notice that q depends on a prior only in q0 so πq is aset of parameters from which the distribution of q0 is sam-pled (typically πq are parameters of a Dirichlet distribution)In non-Bayesian settings π often denotes P(q0) itself Typi-cally each prior will be conjugate i e it will be the sameclass of distributions as the posterior Thus πS and πA(k)the hyperparameters of q0 and the k-th row ofA will bethe αi of a Dirichlet distribution and πθ = (αβ νmicro0) willbe the parameters of a Normal-Inverse Gamma distribution

Though there are several sampling schemes available[58] has argued strongly in favor of Forward-BackwardGibbs sampling (FBG) [57] Variations of this have beenimplemented for segmentation of aCGH data before [6062 73] However sampling-based Bayesian methods suchas FBG are computationally expensive Recently [62] haveintroduced compressed FBG by sampling over a shortersequence of sufficient statistics of data segments whichare likely to come from the same underlying state LetB = (Bw)Ww=1 be a partition of y into W blocks Each blockBw contains nw elements Let ywk the k-th element in Bw

10

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

The forward variable αw( j) for this block needs to take intoaccount the nw emissions the transitions into state j andthe nw minus 1 self-transitions which yields

αw( j) = Anwminus1j j P(Bw |micro j σ

2j )

nwsum

i=1

αwminus1(i)Ai j and

P(Bw |microσ2) =nwprod

k=1

P(ywk |microσ2)

The ideal block structure would correspond to the actualunknown segmentation of the data Any subdivision thereofwould decrease the compression ratio and thus the speedupbut still allow for recovery of the true breakpoints In addi-tion such a segmentation would yield sufficient statisticsfor the likelihood computation that corresponds to the trueparameters of the state generating a segment Using wavelettheory we show that such a block structure can be easilyobtained

Wavelet theory preliminaries

Here we review some essential wavelet theory for detailssee [74 75] Let

ψ(x) =

1 0le x lt 12

minus1 12 le x lt 1

0 elsewhere

be the Haar wavelet [76] and ψ jk(x) = 2 j2ψ(2 j x minus k) jand k are called the scale and shift parameter Any square-integrable function over the unit interval f isin L2([01)) canbe approximated using the orthonormal basis ψ jk | j k isinZminus1le j 0le k le 2 jminus1 admitting a multiresolution analy-sis [77 78] Essentially this allows us to express a functionf (x) using scaled and shifted copies of one simple basisfunction ψ(x) which is spatially localized i e non-zero ononly a finite interval in x The Haar basis is particularlysuited for expressing piecewise constant functions

Finite data y = (y0 yTminus1) can be treated as an equidis-tant sample f (x) by scaling the indices to the unit intervalusing x t = t

T Let h = log2 T Then y can be expressedexactly as a linear combination over the Haar wavelet basisabove restricted to the maximum level of sampling resolu-tion ( j le hminus 1)

yt =sum

jk

d jkψ jk(x t)

The wavelet transform d =W y is an orthogonal endomor-phism and thus incurs neither redundancy nor loss of in-formation Surprisingly d can be computed in linear timeusing the pyramid algorithm [77]

Compression via wavelet shrinkage

The Haar wavelet transform has an important property con-necting it to block creation Let d be a vector obtained

by setting elements of d to zero then y =W ᵀd = W ᵀd iscalled selective wavelet reconstruction (SW) If all coefficientsd jk with ψ jk(x t) 6= ψ jk(x t+1) are set to zero for some tthen yt = yt+1 which implies a block structure on y Con-versely blocks of size gt 2 (to account for some pathologicalcases) can only be created using SW This general equivalencebetween SW and compression is central to our method Notethat y does not have to be computed explicitly the blockboundaries can be inferred from the position of zero-entriesin d alone

Assume all HMM states had the same emission varianceσ2 Since each state is associated with an emission meanfinding q can be viewed as a regression or smoothing prob-lem of finding an estimate micro of a piecewise constant functionmicro whose range is precisely the set of emission means i e

micro= f (x) y= f (x) + ε εsimiid N(0σ2)

Unfortunately regression methods typically do not limitthe number of distinct values recovered and will insteadreturn some estimate y 6= micro However if y is piecewiseconstant and minimizes microminus y the sample means of eachblock are close to the true emission means This yieldshigh likelihood for their corresponding state and hence astrong posterior distribution leading to fast convergenceFurthermore the change points in micromust be close to changepoints in y since moving block boundaries incurs additionalloss allowing for approximate recovery of true breakpointsy might however induce additional block boundaries thatreduce the compression ratio

In a series of ground-breaking papers Donoho Johnstoneet al [79ndash83] showed that SW could in theory be used as analmost ideal spatially adaptive regression method Assum-ing one could provide an oracle ∆(microy) that would knowthe true micro then there exists a method MSW(y∆) = W ᵀSWusing an optimal subset of wavelet coefficients provided by∆ such that the quadratic risk of ySW = W ᵀSWd is boundedas

microminus ySW22 = O

σ2 ln TT

By definition MSW would be the best compression methodunder the constraints of the Haar wavelet basis This boundis generally unattainable since the oracle cannot be queriedInstead they have shown that for a method MWCT(yλσ)called wavelet coefficient thresholding which sets coefficientsto zero whose absolute value is smaller than some thresholdλσ there exists some λT le

p2 ln T with yWCT = W ᵀWCTd

such that

microminus yWCT22 le (2 ln T + 1)

microminus ySW22 +σ2

T

This λT is minimax i e the maximum risk incured overall possible data sets is not larger than that of any otherthreshold and no better bound can be obtained It is noteasily computed but for large T on the order of tens tohundreds the universal threshold λu

T =p

2 ln T is asymp-totically minimax In other words for data large enough

11

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

to warrant compression universal thresholding is the bestmethod to approximate micro and thus the best wavelet-basedcompression scheme for a given noise level σ2

Integrating wavelet shrinkage into FBG

This compression method can easily be extended to multipleemission variances Since we use a thresholding methoddecreasing the variance simply subdivides existing blocks Ifthe threshold is set to the smallest emission variance amongall states y will approximately preserve the breakpointsaround those low-variance segments Those of high vari-ance are split into blocks of lower sample variance see[84 85] for an analytic expression While the variances forthe different states are not known FBG provides a priorisamples in each iteration We hence propose the follow-ing simple adaptation In each sampling iteration use thesmallest sampled variance parameter to create a new blocksequence via wavelet thresholding (Algorithm 1)

Algorithm 1 Dynamically adaptive FBG for HMMs

1 procedure HAMMLET(y πA πθ πq)2 T larr |y |3 λlarr

p2 ln T

4 A sim P(A |πA )5 θ sim P(θ |πθ )6 for i = 1 N do7 σminlarrminσi

σMADσi |σ2i isin θ

8 Create block sequence B from threshold λσmin9 qsim P(A Bθ |q)P(q |πq)

10 A sim P(q |A )P(A |πA )11 θ sim P(qB |θ )P(θ |πθ )12 Add count of marginal states for q to result13 end for14 end procedure

While intuitively easy to understand provable guaranteesfor the optimality of this method specifically the correspon-dence between the wavelet and the HMM domain remainan open research topic A potential problem could ariseif all sampled variances are too large In this case blockswould be under-segmented yield wrong posterior variancesand hide possible state transitions As a safeguard againstover-compression we use the standard method to estimatethe variance of constant noise in shrinkage applications

σ2MAD =

MADkdlog2 Tminus1k

Φminus1

34

2

as an estimate of the variance in the dominant com-ponent and modify the threshold definition to λ middotmin σMADσi isin θ If the data is not iid σ2

MAD will sys-tematically underestimate the true variance [24] In thiscase the blocks get smaller than necessary thus decreasingthe compression

A data structure for dynamic compression

The necessity to recreate a new block sequence in each it-eration based on the most recent estimate of the smallestvariance parameter creates the challenge of doing so with lit-tle computational overhead specifically without repeatedlycomputing the inverse wavelet transform or considering allT elements in other ways We achieve this by creating asimple tree-based data structure

The pyramid algorithm yields d sorted according to ( j k)Again let h = log2 T and ` = h minus j We can map thewavelet ψ jk to a perfect binary tree of height h such thatall wavelets for scale j are nodes on level ` nodes withineach level are sorted according to k and ` is increasingfrom the leaves to the root (Fig 7) d represents a breadth-first search (BFS) traversal of that tree with d jk being theentry at position b2 jc+ k Adding yi as the i-th leaf on level` = 0 each non-leaf node represents a wavelet which isnon-zero for the n = 2` data points yt for t in the theinterval I jk = [kn (k+ 1)nminus 1] stored in the leaves belownotice that for the leaves kn= t

This implies that the leaves in any subtree all have thesame value after wavelet thresholding if all the wavelets inthis subtree are set to zero We can hence avoid computingthe inverse wavelet transform to create blocks Instead eachnode stores the maximum absolute wavelet coefficient in theentire subtree as well as the sufficient statistics required forcalculating the likelihood function More formally a nodeN`t corresponds to wavelet ψ jk with ` = hminus j and t = k2`

(ψminus10 is simply constant on the [01) interval and has noeffect on block creation thus we discard it) Essentially `numbers the levels beginning at the leaves and t marks thestart position of the block when pruning the subtree rootedat N`t The members stored in each node are

bull The number of leaves corresponding to the block size

N`t[n] = 2`

bull The sum of data points stored in the subtree leaves

N`t[Σ1] =sum

iisinI jk

yi

bull Similarly the sum of squares

N`t[Σ2] =sum

iisinI jk

y2i

bull The maximum absolute wavelet coefficient of the sub-tree including the current d jk itself

N0t[d] = 0 N`gt0t[d] =max`primele`

tlet primeltt+2`

dhminus`prime2`primet prime

All these values can be computed recursively from the childnodes in linear time As some real data sets contain salt-and-pepper noise which manifests as isolated large coefficients

12

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

on the lowest level its is possible to ignore the first level inthe maximum computation so that no information to createa single-element block for outliers is passed up the tree Werefer to this technique as noise control Notice that this doesnot imply that blocks are only created at even t since truetransitions manifest in coefficients on multiple levels

The block creation algorithm is simple upon constructionthe tree is converted to depth-first search (DFS) order whichsimply amounts to sorting the BFS array according to (kn j)Given a threshold the tree is then traversed in DFS order byiterating linearly over the array (Fig 7 solid lines) Once themaximum coefficient stored in a node is less or equal to thethreshold a block of size n is created and the entire subtreeis skipped (dashed lines) As the tree is perfect binary andcomplete the next array position in DFS traversal afterpruning the subtree rooted at the node at index i is simplyobtained as i + 2nminus 1 so no expensive pointer structureneeds to be maintained leaving the tree data structure asimple flat array An example of dynamic block creation isgiven in Fig 8

Once the Gibbs sampler converges to a set of variancesthe block structure is less likely to change To avoid recre-ating the same block structure over and over again weemploy a technique called block structure prediction Sincethe different block structures are subdivisions of each otherthat occur in a specific order for decreasing σ2 there is asimple bijection between the number of blocks and the blockstructure itself Thus for each block sequence length weregister the minimum and maximum variance that createsthat sequence Upon entering a new iteration we checkif the current variance would create the same number ofblocks as in the previous iteration which guarantees thatwe would obtain the same block sequence and hence canavoid recomputation

The wavelet tree data structure can be readily extendedto multivariate data of dimensionality m Instead of storingm different trees and reconciling m different block patternsin each iteration one simply stores m different values foreach sufficient statistic in a tree node Since we have totraverse into the combined tree if the coefficient of any ofthe m trees was below the threshold we simply store thelargest N`t[d] among the corresponding nodes of the treeswhich means that the block creation can be done in O(T)instead of O(mT ) i e the dimensionality of the data onlyenters into the creation of the data structure but not thequery during sampling iterations

Automatic priors

While Bayesian methods allow for inductive bias such asthe expected location of means it is desirable to be able touse our method even when little domain knowledge existsor large variation is expected such as the lab and batcheffects commonly observed in micro-arrays [86] as wellas unknown means due to sample contamination SinceFBG does require a prior even in that case we propose the

0 5 10 15Starting position of block (t)

1

2

4

8

16

Bloc

k si

ze (n

)

DFS array (i+1)wavelet treepruning jumps (i+2n-1)

0

1

2

3

4

Leve

l ()

Figure 7 Mapping of wavelets ψ jk and data points yt to treenodes N`t Each node is the root of a subtree with n= 2` leavespruning that subtree yields a block of size n starting at positiont For instance the node N16 is located at position 13 of the DFSarray (solid line) and corresponds to the wavelet ψ33 A block ofsize n = 2 can be created by pruning the subtree which amountsto advancing by 2nminus 1= 3 positions (dashed line) yielding N38

at position 16 which is the wavelet ψ11 Thus the number ofsteps for creating blocks per iteration is at most the number ofnodes in the tree and thus strictly smaller than 2T

1248

163264

128256

Bloc

k si

ze

0 50 100 150 200 250Position along chromosome

10

05

00

05

10

15

20

25

Mea

sure

men

t

Figure 8 Example of dynamic block creation The data is of sizeT = 256 so the wavelet tree contains 512 nodes Here only 37entries had to be checked against the threshold (dark line) 19of which (round markers) yielded a block (vertical lines on thebottom) Sampling is hence done on a short array of 19 blocksinstead of 256 individual values thus the compression ratio is135 The horizontal lines in the bottom subplot are the blockmeans derived from the sufficient statistics in the nodes Noticehow the algorithm creates small blocks around the breakpointse g at t asymp 125 which requires traversing to lower levels andthus induces some additional blocks in other parts of the tree(left subtree) since all block sizes are powers of 2 This some-what reduces the compression ratio which is unproblematic as itincreases the degrees of freedom in the sampler

13

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

following method to specify hyperparameters of a weak priorautomatically Posterior samples of means and variances aredrawn from a Normal-Inverse Gamma distribution (microσ2)simNIΓ(micro0ναβ) whose marginals simply separate into aNormal and an Inverse Gamma distribution

σ2 sim IΓ(αβ) microsim N

micro0σ2

ν

Let s2 be a user-defined variance (or automatically inferede g from the largest of the finest detail coefficients orσ2

MAD) and p the desired probability to sample a variancenot larger than s2 From the CDF of IΓ we obtain

p = P(σ2 le s2) =Γ

α βs2

Γ (α)=Q

αβ

s2

IΓ has a mean for α gt 1 and closed-form solutions forα isin N Furthermore IΓ has positive skewness for α gt 3We thus let α= 2 which yields

β = minuss2

Wminus1

minuspe

+ 1

0lt p le 1

where Wminus1 is the negative branch of the Lambert W -functionwhich is transcendental However an excellent analyticalapproximation with a maximum error of 0025 is given in[87] which yields

β asymp s2

2p

b

M1

pb+p

2

M2 b exp

M3

pb

+ 1 + b

b = minus ln p

M1 = 03361 M2 = minus00042 M3 = minus00201

Since the mean of IΓ is βαminus1 the expected variance of micro is βν

for α= 2 To ensure proper mixing we could simply set βmicroto the sample variance of the data which can be estimatedfrom the sufficient statistics in the root of the wavelet tree(the first entry in the array) provided that micro contained allstates in almost equal number However due to possibleclass imbalance means for short segments far away from micro0can have low sampling probability as they do not contributemuch to the sample variance of the data We thus define δ tobe the sample variance of block means in the compressionobtained by σ2

MAD and take the maximum of those twovariances We thus obtain

micro0 =Σ1

n and ν= βmax

nΣ2 minusΣ21

n2δ

minus1

Numerical issues

To assure numerical stability many HMM implementationsresort to log-space computations Our implementationwhich differs from [62] reflects the block structure andavoids a considerable number of expensive function calls(exp log pow) required by others This yields an additional

speedup of about one order of magnitude compared to thelog version (data not shown) The term accounting for emis-sions and self-transitions within the block can be writtenas

Anwminus1j j

(2π)nw2σnwj

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

Any constant cancels out during normalization Further-more exponentiation of potentially small numbers causesunderflows We hence move those terms into the exponentutilizing the much stabler logarithm function

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

+ (nw minus 1) log A j j minus nw logσ j

Using the blockrsquos sufficient statistics

nw Σ1 =nwsum

k=1

ywk Σ2 =nwsum

k=1

y2wk

the exponent can be rewritten as

Ew( j) =2micro jΣ1 minusΣ2

2σ2j

+ K(nw j)

K(nw j) = (nw minus 1) log A j j minus nw

logσ j +micro2

j

2σ2j

Note that K(nw j) can be precomputed for each iterationthus greatly reducing the number of expensive function callsTo avoid overflow of the exponential function we subtractthe largest such exponents among all states hence Ew( j)le0 This is equivalent to dividing the forward variables by

exp

maxk

Ew(k)

which cancels out during normalization Hence we obtain

αw( j) = exp

Ew( j)minusmaxk

Ew(k)

nwsum

i=1

αwminus1(i)Ai j

which are then normalized to

αw( j) =αw( j)

sum

k αw(k)

Availability of supporting data

The implementation as well as the supplemental data areavailable at httpbioinformaticsrutgerseduSoftwareHaMMLET

Competing interests

The authors declare that the research was conducted in theabsence of any commercial or financial relationships thatcould be construed as a potential conflict of interest

14

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 2: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

Genome position

(a) Standard FB Gibbs

Log-

ratio

s

Per-position latent variables

Genome position

(b) Wavelet Compression

Log-

ratio

s

Per-block latent variables

Sufficientstatistics

Genome position

Small minimum variance

Log-

ratio

s

Genome position

Large minimum varianceLo

g-ra

tios

Compute wavelet compression for minimum variance in θ

Aθq

Currentparameters

Newparameters

Sample transition matrix and emission parameters

Sample new statesequence q

Waveletblocks ampsufficientstatistics

(c) FB Gibbs sampling Loop

Figure 1 Overview of HaMMLET Instead of individual computations per observation (panel a) Forward-Backward Gibbs Samplingis performed on a compressed version of the data using sufficient statistics for block-wise computations (panel b) to accelerateinference in Bayesian Hidden Markov Models During the sampling (panel c) parameters and copy number sequences are samplediteratively During each iteration the sampled variances determine which coefficients of the datarsquos Haar wavelet transform aredynamically set to zero This controls potential break points at finer or coarser resolution or equivalently defines blocks of variablenumber and size (panel c bottom) Our approach thus yields a dynamic adaptive compression scheme which greatly improves speedof convergence accuracy and running times

2

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

CN states The clustering heuristic relies on empirically de-rived parameters not supported by a theoretical analysiswhich can lead to suboptimal clustering or overfitting Alsothe method cannot easily be generalized for multivariatedata Lastly the implementation was primarily aimed atcomparative analysis between the frequentist and Bayesianapproach as opposed to overall speed

To address these issues and make Bayesian CNV inferencefeasible even on a laptop we propose the combination ofHMMs with another popular signal processing technologyHaar wavelets have previously been used in CNV detection[63] mostly as a preprocessing tool for statistical down-stream applications [24ndash28] or simply as a visual aid inGUI applications [17 64] A combination of smoothing andsegmentation has been suggested as likely to improve re-sults [65] and here we show that this is indeed the caseThe wavelets provide a theoretical foundation for a betterdynamic compression scheme for faster convergence andaccelerated Bayesian sampling We improve simultaneouslyupon the typically conflicting goals of accuracy and speedbecause the wavelets allow summary treatment of ldquoeasyrdquoCN calls in segments and focus computational effort on theldquodifficultrdquo CN calls dynamically and adaptively This is incontrast to other computationally efficient tools which oftensimplify the statistical model or use heuristics The requireddata structure can be efficiently computed incurs minimaloverhead and has a straightforward generalization to mul-tivariate data We further show how the wavelet transformyields a natural way to set hyperparameters automaticallywith little input from the user

We implemented our method in a highly optimized end-user software called HaMMLET Aside from achieving anacceleration of up to two orders of magnitude it exhibitssignificantly improved convergence behavior has excellentprecision and recall and provides Bayesian inference withinseconds even for large data sets The accuracy and speedof HaMMLET also makes it an excellent choice for routinediagnostic use and large-scale re-analysis of legacy data

Results and discussion

Simulated aCGH data

A previous survey [65] of eleven CNV calling methods foraCGH has established that segmentation-focused methodssuch as DNAcopy [32 33] an implementation of circularbinary segmentation (CBS) as well as CGHseg [35] performconsistently well DNAcopy performs a number of t-tests todetect break-point candidates The result is typically over-segmented and requires a merging step in post-processingespecially to reduce the number of segment means To thisend MergeLevels was introduced by [66] They comparethe combination DNAcopy+MergeLevels to their own HMMimplementation [45] as well as GLAD [23] showing itssuperior performance over both methods This establishedDNAcopy+MergeLevels as the de facto standard in CNV

lt5 5-10 10-20 gt20

00

02

04

06

08

10

BDampM SRS T

Figure 2 F-measures of CBS (light) and HaMMLET (dark) forcalling aberrant copy numbers on simulate aCGH data Boxesrepresent the interquartile range (IQR = Q3-Q1) with a horizontalline showing the median (Q2) whiskers representing the range( 3

2 IQR beyond Q1 and Q3) and the bullet representing the meanHaMMLET has the same or better F-measures in most cases andconverges to 1 for larger segments whereas CBS plateaus foraberrations greater than 10

detection despite the comparatively long running timeThe paper also includes aCGH simulations deemed to

be reasonably realistic by the community DNACopywas used to segment 145 unpublished samples of breastcancer data and subsequently labeled as copy num-bers 0 to 5 by sorting them into bins with boundaries(minusinfinminus04minus02020406infin) based on the samplemean in each segment (the last bin appears to not beused) Empirical length distributions were derived fromwhich the sizes of CN aberrations are drawn The data it-self is modeled to include Gaussian noise which has beenestablished as a sufficient for aCGH data [67] Meanswere generated such as to mimic random tumor cell pro-portions and random variances were chosen to simulateexperimenter bias often observed in real data this em-phasizes the importance of having automatic priors avail-able when using Bayesian methods as the means and vari-ances might be unknown a priori The simulations areavailable at httpwwwcbsdtudk~hanniaCGHThey comprise three sets of simulations ldquobreakpoint de-tection and mergingrdquo (BDampM) ldquospatial resolution studyrdquo(SRS) and ldquotestingrdquo (T) (see their paper for details) Weused the MergeLevels implementation as provided on theirwebsite It should be noted that the superiority of DNA-copy+MergeLevels was established using a simulation basedupon segmentation results of DNAcopy itself

We used the Bioconductor package DNAcopy (version1240) and followed the procedure suggested therein in-cluding outlier smoothing This version uses the linear-timevariety of CBS [34] note that other authors such as [31]compare against its quadratic-time version [33] which is

3

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

likely to yield a favorable comparison and speedups of sev-eral orders of magnitude especially on large data ForHaMMLET we use a 5-state model with automatic hyperpa-rameters P(σ2 le 001) = 09 (see section Automatic priors)and all Dirichlet hyperparameters set to 1

We report F-measures for detecting aberrant segments(Fig 2) On datasets T and BDampM both methods have simi-lar medians but HaMMLET has a much better interquartilerange (IQR) and range about half of CBSrsquos On the spa-tial resolution data set (SRS) HaMMLET performs muchbetter on very small aberrations This might seem some-what surprising as short segments could easily get lostunder compression However Lai et al [65] have notedthat smoothing-based methods such as quantile smoothing(quantreg) [19] lowess [20] and wavelet smoothing [25]perform particularly well in the presence of high noise andsmall CN aberrations suggesting that ldquoan optimal combi-nation of the smoothing step and the segmentation stepmay result in improved performancerdquo Our wavelet-basedcompression inherits those properties For CNVs of sizesbetween 5 and 10 CBS and HaMMLET have similar rangeswith CBS being skewed towards better values CBS has aslightly higher median for 10ndash20 with IQR and range beingabout the same However while HaMMLETrsquos F-measureconsistently approaches 1 for larger aberrations CBS doesnot appear to significantly improve after size 10

High-density CGH array

In this section we demonstrate HaMMLETrsquos performanceon biological data Due to the lack of a gold standard forhigh-resolution platforms we assess the CNV calls quali-tatively We use raw aCGH data (GEOGSE23949) [68] ofgenomic DNA from breast cancer cell line BT-474 (invasiveductal carcinoma GEOGSM590105) on an Agilent-021529Human CGH Whole Genome Microarray 1x1M platform(GEOGPL8736) We excluded gonosomes mitochondrialand random chromosomes from the data leaving 966432probes in total

HaMMLET allows for using automatic emission priors (seesection Automatic priors) by specifying a noise variance anda probability to sample a variance not exceeding this valueWe compare HaMMLETrsquos performance against CBS usinga 20-state model with automatic priors P(σ2 le 01) = 0810 prior self-transitions and 1 for all other hyperparametersCBS took over 2 h 9 m to process the entire array whereasHaMMLET took 271 s for 100 iterations a speedup of 288The compression ratio (see section Speed and convergenceeffects of wavelet compression) was 2203 CBS yielded amassive over-segmentation into 1548 different copy num-ber levels As the data is derived from a relatively homo-geneous cell line as opposed to a biopsy we do not expectthe presence of subclonal populations to be a contributingfactor [69 70] Instead measurements on aCGH are knownto be spatially correlated resulting in a wave pattern whichhas to be removed in a preprocessing step CBS performs

such a smoothing yet the unrealistically large number ofdifferent levels is likely due to residuals of said wave pat-tern Repeated runs of CBS yielded different numbers oflevels suggesting that indeed the merging was incompleteThis further highlights the importance of prior knowledgein statistical inference By limiting the number of CN statesa priori non-iid effects are easily remedied and inferencecan be performed without preprocessing Furthermore theinternal compression mechanism of HaMMLET is derivedfrom a spatially adaptive regression method so smoothingis inherent to our method The results are included in thesupplement

For a more comprehensive analysis we restricted ourevaluation to chromosome 20 (21687 probes) which weassessed to be the most complicated to infer as it appears tohave the highest number of different CN states and break-points CBS yields a 19-state result after 1578 s (Fig 3top) We have then used a 19-state model automated priors(P(σ2 le 004) = 09) 10 prior self-transitions all otherDirichlet parameters set to 1) to reproduce this result Us-ing noise control (see Methods) our method took 161 sfor 600 iterations The solution we obtained is consistentwith CBS (Fig 3 middle and bottom) However only 11states were part of the final solution i e 6 states yieldedno significant likelihood above that of other states We ob-serve superfluous states being ignored in our simulations aswell cf Supplement In light of the results on the entirearray we suggest that the segmentation by DNAcopy hasnot sufficiently been merged by MergeLevels Most strik-ingly HaMMLET does not show any marginal support fora segment called by CBS around probe number 4500 Wehave confirmed that this is not due to data compressionas the segment is broken up into multiple blocks in eachiteration (data not shown) On the other hand two muchsmaller segments called by CBS in the 17000ndash20000 rangedo have marginal support of about 40 in HaMMLET sug-gesting that the lack of support for the larger segment iscorrect It should be noted that inference differs betweenthe entire array and chromosome 20 in both methods sincelong-range effects have higher impact in larger data

We also demonstrate another feature of HaMMLET callednoise control While Gaussian emissions have been deemed asufficiently accurate noise model for aCGH [67] microarraydata is prone to outliers for example due to damages on thechip While it is possible to model outliers directly [60] thecharacteristics of the wavelet transform allow us to largelysuppress them during the construction of our data structure(see Methods) Notice that due to noise control most outliersare correctly labeled according to the segment they occurin while the short gain segment close to the beginning iscalled correctly

4

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Figure 3 Copy number inference for chromosome 20 in invasive ductal carcinoma (21687 probes) CBS creates a 19-state solution(top) however a compressed 19-state HMM only supports an 11-state solution (bottom) suggesting insufficient level merging in CBS

5

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Speed and convergence effects of wavelet com-pression

The speedup gained by compression depends on how wellthe data can be compressed Poor compression is expectedwhen the means are not well separated or short segmentshave small variance which necessitates the creation ofsmaller blocks for the rest of the data to expose potentiallow-variance segments to the sampler On the other handdata must not be over-compressed to avoid merging smallaberrations with normal segments which would decreasethe F-measure The level of compression is measured withthe compression ratio Due to the dynamic changes to theblock structure we define compression ratio as the totalnumber of blocks in all iterations divided by the productof the number of data points T and the number of itera-tions N As usual a compression ratio of one indicates nocompression

To evaluate the impact of dynamic wavelet compressionon speed and convergence properties of an HMM wecreated 129600 different data sets with T = 32768 manyprobes In each data set we randomly distributed 1 to6 gains of a total length of 1002505007501000uniformly among the data and do the same for lossesMean combinations (microlossmicroneutralmicrogain) were cho-sen from (log2

12 log2 1 log2

32 ) (minus101) (minus202)

and (minus10 0 10) and variances (σ2lossσ

2neutralσ

2gain)

from (005005005) (050109) (030201)(020103) (010302) and (010101) Thesevalues have been selected to yield a wide range of easyand hard cases both well separated low-variance datawith large aberrant segments as well as cases in whichsmall aberrations overlap significantly with the tailsamples of high-variance neutral segments Consequentlycompression ratios range from sim1 to sim2100 We useautomatic priors P(σ2 le 02) = 09 self-transition priorsαii isin 101001000 non-self transition priors αi j = 1and initial state priors α isin 110 Using all possiblecombinations of the above yields 129600 differentsimulated data sets a total of 42 billion values

We achieve speedups per iteration of up to 350 com-pared to an uncompressed HMM (Fig 5) In contrast [62]have reported ratios of 10ndash60 with one instance of 90 No-tice that the speedup is not linear in the compression ratioWhile sampling itself is expected to yield linear speedup themarginal counts still have to be tallied individually for eachposition and dynamic block creation causes some overheadThe quantization artifacts observed for larger speedup arelikely due to the limited resolution of the Linux time com-mand (10 milliseconds) Compressed HaMMLET took about112 CPU hours for all 129600 simulations whereas theuncompressed version took over 3 weeks and 5 days Allrunning times reported are CPU time measured on a singlecore of a AMD OpteronTM 6174 Processor clocked at 22GHz

We evaluate the convergence of the F-measure of com-

Figure 5 HaMMLETrsquos speedup as a function of the average com-pression during sampling As expected higher compression leadsto greater speedup The non-linear characteristic is due to thefact that some overhead is incurred by the dynamic compressionas well as parts of the implementation that do not depend on thecompression such as tallying marginal counts

pressed and uncompressed inference for each simulationSince we are dealing with multi-class classification we usethe micro- and macro-averaged F-measures (Fmi Fma) pro-posed by [71] they tend to be dominated by the classifierrsquosperformance on common and rare categories respectivelySince all state labels are sampled from the same prior andhence their relative order is random we used the labelpermutation which yielded the highest sum of micro- andmacro-averaged F-measures

In Fig 4 we show that the compressed version of theGibbs sampler converges almost instantly whereas the un-compressed version converges much slower with about5 of the cases failing to yield an F-measure gt 06 within1000 iterations Wavelet compression is likely to yield rea-sonably large blocks for the majority class early on whichleads to a strong posterior estimate of its parameters andself-transition probabilities As expected Fma are generallyworse since any misclassification in a rare class has a largerimpact Especially in the uncompressed version we observethat Fma tends to plateau until Fmi approaches 10 Sinceany misclassification in the majority (neutral) class addsfalse positives to the minority classes this effect is expectedIt implies that correct labeling of the majority class is a nec-essary condition for correct labeling of minority classes inother words correct identification of the rare interestingsegments requires the sampler to properly converge whichis much harder to achieve without compression It shouldbe noted that running compressed HaMMLET for 1000iterations is unnecessary on the simulated data as in allcases it converges between 25 and 50 iterations Thus forall practical purposes an additional speedup of 40ndash80 canbe achieved by reducing the number of iterations whichyields convergence up to 3 orders of magnitude faster thanstandard FBG

6

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

00

02

04

06

08

10

Mic

ro-a

vera

ged F

-measu

re

Uncompressed Compressed

0 200 400 600 800 1000Iteration

00

02

04

06

08

10

Macr

o-a

vera

ged F

-measu

re

0 200 400 600 800 1000Iteration

0 25

08

10

0 25

08

10

Figure 4 The median value (black) and quantile ranges (in 5 steps) of the micro- (top) and macro-averaged (bottom) F-measuresfor uncompressed (left) and compressed (right) FBG inference on the same 129600 simulated data sets using automatic priors Thex-axis represents the number of iterations alone and does not reflect the additional speedup obtained through compression Noticethat the compressed HMM converges no later than 50 iterations (inset figures right)

7

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Coriell ATCC and breast carcinoma

The data provided by [72] includes 15 aCGH samples forthe Coriell cell line At about 2000 probes the data issmall compared to modern high-density arrays Neverthe-less these data sets have become a common standard toevaluate CNV calling methods as they contain few and sim-ple aberrations The data also contains 6 ATCC cell lines aswell as 4 breast carcinoma all of which are substantiallymore complicated and typically not used in software eval-uations In Fig 6 we demonstrate our ability to infer thecorrect segments on the most complex example a T47Dbreast ductal carcinoma sample of a 54 year old female Weused 6-state automatic priors with P(σ2 le 01) = 085 andall Dirichlet hyperparameters set to 1 On a standard laptopHaMMLET took 009 seconds for 1000 iterations runningtimes for the other samples were similar Our results for all25 data sets have been included in the supplement

Conclusions

In the analysis of biological data there are usually con-flicting objectives at play which need to be balanced therequired accuracy of the analysis ease of usemdashusing thesoftware setting software and method parametersmdashandoften the speed of a method Bayesian methods have ob-tained a reputation of requiring enormous computationaleffort and being difficult to use for the expert knowledgerequired for choosing prior distributions It has also beenrecognized [60 62 73] that they are very powerful andaccurate leading to improved high-quality results and pro-viding in the form of posterior distributions an accuratemeasure of uncertainty in results Nevertheless it is not sur-prising that a hundred times larger effort in computationaleffort alone prevented wide-spread use

Inferring Copy Number Variants (CNV) is a quite specialproblem as experts can identify CN changes visually atleast on very good data and for large drastic CN changes(e g long segments lost on both chromosomal copies)With lesser quality data smaller CN differences and in theanalysis of cohorts the need for objective highly accurateand automated methods are evident

The core idea for our method expands on our prior work[62] and affirms a conjecture by Lai et al [65] that a com-bination of smoothing and segmentation will likely improveresults One ingredient of our method are Haar waveletswhich have previously been used for pre-processing andvisualization [17 64] In a sense they quantify and identifyregions of high variation and allow to summarize the dataat various levels of resolution somewhat similar to howan expert would perform a visual analysis We combinefor the first time wavelets with a full Bayesian HMM bydynamically and adaptively infering blocks of subsequentobservations from our wavelet data structure The HMMoperates on blocks instead of individual observations whichleads to great saving in running times up to 350-fold com-

pared to the standard FB-Gibbs sampler and up to 288times faster than CBS Much more importantly operatingon the blocks greatly improves convergence of the samplerleading to higher accuracy for a much smaller number ofsampling iterations Thus the combination of wavelets andHMM realizes a simultaneous improvement on accuracyand on speed typically one can have one or the other Anintuitive explanation as to why this works is that the blocksderived from the wavelet structure allow efficient summarytreatment of those ldquoeasyrdquo to call segments given the currentsample of HMM parameters and identifies ldquohardrdquo to callCN segment which need the full computational effort fromFB-Gibbs Note that it is absolutely crucial that the blockstructure depends on the parameters sampled for the HMMand will change drastically over the run time This is incontrast to our prior work [62] which used static blocksand saw no improvements to accuracy and convergencespeed The data structures and linear-time algorithms weintroduce here provide the efficient means for recomputingthese blocks at every cycle of the sampling cf Fig 1 Com-pared to our prior work we observe a speed-up of up to3000 due to the greatly improved convergence O(T) vsO(T log T ) clustering improved numerics and lastly a C++instead of a Python implementation

We performed an extensive comparison with the state-of-the-art as identified by several review and benchmarkpublications and with the standard FB-Gibbs sampler on awide range of biological data sets and 129600 simulateddata sets which were produced by a simulation processnot based on HMM to make it a harder problem for ourmethod All comparisons demonstrated favorable resultsfor our method when measuring accuracy at a very notice-able acceleration compared to the state-of-the-art It mustbe stressed that these results were obtained with a statisti-cally sophisticated model for CNV calls and without cuttingalgorithmic corners but rather an effective allocation ofcomputational effort

All our computations are performed using our automaticprior which derives estimates for the hyperparameters ofthe priors for means and variances directly from the wavelettree structure and the resulting blocks The block structurealso imposes a prior on self-transition probabilities Theuser only has to provide an estimate of the noise variancebut as the automatic prior is designed to be weak the priorand thus the method is robust against incorrect estimatesWe have demonstrated this by using different hyperparam-eters for the associated Dirichlet priors in our simulationswhich HaMMLET is able to infer correctly regardless ofthe transition priors At the same time the automatic priorcan be used to tune certain aspects of the HMM if strongerprior knowledge is available We would expect further im-provements from non-automatic expert-selected priors butrefrained from using them for the evaluation as they mightbe perceived as unfair to other methods

In summary our method is as easy to use as other sta-tistically less sophisticated tools more accurate and much

8

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Figure 6 HaMMLETrsquos inference of copy-number segments on T47D breast ductal carcinoma Notice that the data is much morecomplex than the simple structure of a diploid majority class with some small aberrations typically observed for Coriell data

9

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

more computationally efficient We believe this makes itan excellent choice both for routine use in clinical settingsand principled re-analysis of large cohorts where the addedaccuracy and the much improved information about uncer-tainty in copy number calls from posterior marginal distri-butions will likely yield improved insights into CNV as asource of genetic variation and its relationship to disease

Methods

We will briefly review Forward-Backward Gibbs sampling(FBG) for Bayesian Hidden Markov Models and its accel-eration through compression of the data into blocks Byfirst considering the case of equal emission variances amongall states we show that optimal compression is equivalentto a concept called selective wavelet reconstruction follow-ing a classic proof in wavelet theory We then argue thatwavelet coefficient thresholding a variance-dependent mini-max estimator allows for compression even in the case ofunequal emission variances This allows the compression ofthe data to be adapted to the prior variance level at eachsampling iteration We then derive a simple data structureto dynamically create blocks with little overhead Whilewavelet approaches have been used for aCGH data before[25 29 30 63] our method provides the first combinationof wavelets and HMMs

Bayesian Hidden Markov Models

Let T be the length of the observation sequence which isequal to the number of probes An HMM can be representedas a statistical model (qA θ |y) with transition matrixA a latent state sequence q = (q0 q1 qTminus1) an observedemission sequence y = (y0 y1 yTminus1) and the emissionparameters θ The initial distribution of q0 is often includedand denoted as π but in a Bayesian setting it makes moresense to take π as a set of hyperparameters for q0

In the usual frequentist approach the state sequence q isinferred by first finding a maximum likelihood estimate ofthe parameters

(AMLθML) = arg max(A θ )

L (A θ |y)

using the Baum-Welsh algorithm [53 54] This is only guar-anteed to yield local optima as the likelihood function isnot convex Repeated random reinitializations are used tofind ldquogoodrdquo local optima but there are no guarantees forthis method Then the most likely state sequence giventhose parameters

q= argmaxqP(q |AMLθMLy)

is calculated using the Viterbi algorithm [55] However ifthere are only a few aberrations that is there is imbalancebetween classes the ML parameters tend to overfit the nor-mal state which is likely to yield incorrect segmentation

[62] Furthermore alternative segmentations given thoseparameters are also ignored as are the ones for alternativeparameters

The Bayesian approach is to calculate the distribution ofstate sequences directly by integrating out the emission andtransition variables

P(q |y) =int

A

int

θ

P(qA θ |y)dθ dA

Since this integral is intractable it has to be approximatedusing Markov Chain Monte Carlo techniques i e drawingN samples

(q(i)A (i)θ (i))sim P(qA θ |y)

and subsequently approximating marginal state probabili-ties by their frequency in the sample

P(qt = s |y)asymp1N

Nsum

i=1

I(q(i)t = s)

Thus for each position t we get a complete probabilitydistribution over the possible states As the marginals ofeach variable are explicitly defined by conditioning on theother variables an HMM lends itself to Gibbs samplingi e repeatedly sampling from the marginals (q |A θ y)(A |qθ y) and (θ |qA y) conditioned on the previouslysampled values Using Bayesrsquo formula and several condi-tional independence relations the sampling process can bewritten as

A sim P(A |q) prop P(q |A )P(A |πA )

θ sim P(θ |qy) prop P(qy |θ )P(θ |πθ ) and

qsim P(q |A yθ ) prop P(A yθ |q)P(q |πq)

where π represents hyperparameters to the prior distribu-tion Notice that q depends on a prior only in q0 so πq is aset of parameters from which the distribution of q0 is sam-pled (typically πq are parameters of a Dirichlet distribution)In non-Bayesian settings π often denotes P(q0) itself Typi-cally each prior will be conjugate i e it will be the sameclass of distributions as the posterior Thus πS and πA(k)the hyperparameters of q0 and the k-th row ofA will bethe αi of a Dirichlet distribution and πθ = (αβ νmicro0) willbe the parameters of a Normal-Inverse Gamma distribution

Though there are several sampling schemes available[58] has argued strongly in favor of Forward-BackwardGibbs sampling (FBG) [57] Variations of this have beenimplemented for segmentation of aCGH data before [6062 73] However sampling-based Bayesian methods suchas FBG are computationally expensive Recently [62] haveintroduced compressed FBG by sampling over a shortersequence of sufficient statistics of data segments whichare likely to come from the same underlying state LetB = (Bw)Ww=1 be a partition of y into W blocks Each blockBw contains nw elements Let ywk the k-th element in Bw

10

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

The forward variable αw( j) for this block needs to take intoaccount the nw emissions the transitions into state j andthe nw minus 1 self-transitions which yields

αw( j) = Anwminus1j j P(Bw |micro j σ

2j )

nwsum

i=1

αwminus1(i)Ai j and

P(Bw |microσ2) =nwprod

k=1

P(ywk |microσ2)

The ideal block structure would correspond to the actualunknown segmentation of the data Any subdivision thereofwould decrease the compression ratio and thus the speedupbut still allow for recovery of the true breakpoints In addi-tion such a segmentation would yield sufficient statisticsfor the likelihood computation that corresponds to the trueparameters of the state generating a segment Using wavelettheory we show that such a block structure can be easilyobtained

Wavelet theory preliminaries

Here we review some essential wavelet theory for detailssee [74 75] Let

ψ(x) =

1 0le x lt 12

minus1 12 le x lt 1

0 elsewhere

be the Haar wavelet [76] and ψ jk(x) = 2 j2ψ(2 j x minus k) jand k are called the scale and shift parameter Any square-integrable function over the unit interval f isin L2([01)) canbe approximated using the orthonormal basis ψ jk | j k isinZminus1le j 0le k le 2 jminus1 admitting a multiresolution analy-sis [77 78] Essentially this allows us to express a functionf (x) using scaled and shifted copies of one simple basisfunction ψ(x) which is spatially localized i e non-zero ononly a finite interval in x The Haar basis is particularlysuited for expressing piecewise constant functions

Finite data y = (y0 yTminus1) can be treated as an equidis-tant sample f (x) by scaling the indices to the unit intervalusing x t = t

T Let h = log2 T Then y can be expressedexactly as a linear combination over the Haar wavelet basisabove restricted to the maximum level of sampling resolu-tion ( j le hminus 1)

yt =sum

jk

d jkψ jk(x t)

The wavelet transform d =W y is an orthogonal endomor-phism and thus incurs neither redundancy nor loss of in-formation Surprisingly d can be computed in linear timeusing the pyramid algorithm [77]

Compression via wavelet shrinkage

The Haar wavelet transform has an important property con-necting it to block creation Let d be a vector obtained

by setting elements of d to zero then y =W ᵀd = W ᵀd iscalled selective wavelet reconstruction (SW) If all coefficientsd jk with ψ jk(x t) 6= ψ jk(x t+1) are set to zero for some tthen yt = yt+1 which implies a block structure on y Con-versely blocks of size gt 2 (to account for some pathologicalcases) can only be created using SW This general equivalencebetween SW and compression is central to our method Notethat y does not have to be computed explicitly the blockboundaries can be inferred from the position of zero-entriesin d alone

Assume all HMM states had the same emission varianceσ2 Since each state is associated with an emission meanfinding q can be viewed as a regression or smoothing prob-lem of finding an estimate micro of a piecewise constant functionmicro whose range is precisely the set of emission means i e

micro= f (x) y= f (x) + ε εsimiid N(0σ2)

Unfortunately regression methods typically do not limitthe number of distinct values recovered and will insteadreturn some estimate y 6= micro However if y is piecewiseconstant and minimizes microminus y the sample means of eachblock are close to the true emission means This yieldshigh likelihood for their corresponding state and hence astrong posterior distribution leading to fast convergenceFurthermore the change points in micromust be close to changepoints in y since moving block boundaries incurs additionalloss allowing for approximate recovery of true breakpointsy might however induce additional block boundaries thatreduce the compression ratio

In a series of ground-breaking papers Donoho Johnstoneet al [79ndash83] showed that SW could in theory be used as analmost ideal spatially adaptive regression method Assum-ing one could provide an oracle ∆(microy) that would knowthe true micro then there exists a method MSW(y∆) = W ᵀSWusing an optimal subset of wavelet coefficients provided by∆ such that the quadratic risk of ySW = W ᵀSWd is boundedas

microminus ySW22 = O

σ2 ln TT

By definition MSW would be the best compression methodunder the constraints of the Haar wavelet basis This boundis generally unattainable since the oracle cannot be queriedInstead they have shown that for a method MWCT(yλσ)called wavelet coefficient thresholding which sets coefficientsto zero whose absolute value is smaller than some thresholdλσ there exists some λT le

p2 ln T with yWCT = W ᵀWCTd

such that

microminus yWCT22 le (2 ln T + 1)

microminus ySW22 +σ2

T

This λT is minimax i e the maximum risk incured overall possible data sets is not larger than that of any otherthreshold and no better bound can be obtained It is noteasily computed but for large T on the order of tens tohundreds the universal threshold λu

T =p

2 ln T is asymp-totically minimax In other words for data large enough

11

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

to warrant compression universal thresholding is the bestmethod to approximate micro and thus the best wavelet-basedcompression scheme for a given noise level σ2

Integrating wavelet shrinkage into FBG

This compression method can easily be extended to multipleemission variances Since we use a thresholding methoddecreasing the variance simply subdivides existing blocks Ifthe threshold is set to the smallest emission variance amongall states y will approximately preserve the breakpointsaround those low-variance segments Those of high vari-ance are split into blocks of lower sample variance see[84 85] for an analytic expression While the variances forthe different states are not known FBG provides a priorisamples in each iteration We hence propose the follow-ing simple adaptation In each sampling iteration use thesmallest sampled variance parameter to create a new blocksequence via wavelet thresholding (Algorithm 1)

Algorithm 1 Dynamically adaptive FBG for HMMs

1 procedure HAMMLET(y πA πθ πq)2 T larr |y |3 λlarr

p2 ln T

4 A sim P(A |πA )5 θ sim P(θ |πθ )6 for i = 1 N do7 σminlarrminσi

σMADσi |σ2i isin θ

8 Create block sequence B from threshold λσmin9 qsim P(A Bθ |q)P(q |πq)

10 A sim P(q |A )P(A |πA )11 θ sim P(qB |θ )P(θ |πθ )12 Add count of marginal states for q to result13 end for14 end procedure

While intuitively easy to understand provable guaranteesfor the optimality of this method specifically the correspon-dence between the wavelet and the HMM domain remainan open research topic A potential problem could ariseif all sampled variances are too large In this case blockswould be under-segmented yield wrong posterior variancesand hide possible state transitions As a safeguard againstover-compression we use the standard method to estimatethe variance of constant noise in shrinkage applications

σ2MAD =

MADkdlog2 Tminus1k

Φminus1

34

2

as an estimate of the variance in the dominant com-ponent and modify the threshold definition to λ middotmin σMADσi isin θ If the data is not iid σ2

MAD will sys-tematically underestimate the true variance [24] In thiscase the blocks get smaller than necessary thus decreasingthe compression

A data structure for dynamic compression

The necessity to recreate a new block sequence in each it-eration based on the most recent estimate of the smallestvariance parameter creates the challenge of doing so with lit-tle computational overhead specifically without repeatedlycomputing the inverse wavelet transform or considering allT elements in other ways We achieve this by creating asimple tree-based data structure

The pyramid algorithm yields d sorted according to ( j k)Again let h = log2 T and ` = h minus j We can map thewavelet ψ jk to a perfect binary tree of height h such thatall wavelets for scale j are nodes on level ` nodes withineach level are sorted according to k and ` is increasingfrom the leaves to the root (Fig 7) d represents a breadth-first search (BFS) traversal of that tree with d jk being theentry at position b2 jc+ k Adding yi as the i-th leaf on level` = 0 each non-leaf node represents a wavelet which isnon-zero for the n = 2` data points yt for t in the theinterval I jk = [kn (k+ 1)nminus 1] stored in the leaves belownotice that for the leaves kn= t

This implies that the leaves in any subtree all have thesame value after wavelet thresholding if all the wavelets inthis subtree are set to zero We can hence avoid computingthe inverse wavelet transform to create blocks Instead eachnode stores the maximum absolute wavelet coefficient in theentire subtree as well as the sufficient statistics required forcalculating the likelihood function More formally a nodeN`t corresponds to wavelet ψ jk with ` = hminus j and t = k2`

(ψminus10 is simply constant on the [01) interval and has noeffect on block creation thus we discard it) Essentially `numbers the levels beginning at the leaves and t marks thestart position of the block when pruning the subtree rootedat N`t The members stored in each node are

bull The number of leaves corresponding to the block size

N`t[n] = 2`

bull The sum of data points stored in the subtree leaves

N`t[Σ1] =sum

iisinI jk

yi

bull Similarly the sum of squares

N`t[Σ2] =sum

iisinI jk

y2i

bull The maximum absolute wavelet coefficient of the sub-tree including the current d jk itself

N0t[d] = 0 N`gt0t[d] =max`primele`

tlet primeltt+2`

dhminus`prime2`primet prime

All these values can be computed recursively from the childnodes in linear time As some real data sets contain salt-and-pepper noise which manifests as isolated large coefficients

12

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

on the lowest level its is possible to ignore the first level inthe maximum computation so that no information to createa single-element block for outliers is passed up the tree Werefer to this technique as noise control Notice that this doesnot imply that blocks are only created at even t since truetransitions manifest in coefficients on multiple levels

The block creation algorithm is simple upon constructionthe tree is converted to depth-first search (DFS) order whichsimply amounts to sorting the BFS array according to (kn j)Given a threshold the tree is then traversed in DFS order byiterating linearly over the array (Fig 7 solid lines) Once themaximum coefficient stored in a node is less or equal to thethreshold a block of size n is created and the entire subtreeis skipped (dashed lines) As the tree is perfect binary andcomplete the next array position in DFS traversal afterpruning the subtree rooted at the node at index i is simplyobtained as i + 2nminus 1 so no expensive pointer structureneeds to be maintained leaving the tree data structure asimple flat array An example of dynamic block creation isgiven in Fig 8

Once the Gibbs sampler converges to a set of variancesthe block structure is less likely to change To avoid recre-ating the same block structure over and over again weemploy a technique called block structure prediction Sincethe different block structures are subdivisions of each otherthat occur in a specific order for decreasing σ2 there is asimple bijection between the number of blocks and the blockstructure itself Thus for each block sequence length weregister the minimum and maximum variance that createsthat sequence Upon entering a new iteration we checkif the current variance would create the same number ofblocks as in the previous iteration which guarantees thatwe would obtain the same block sequence and hence canavoid recomputation

The wavelet tree data structure can be readily extendedto multivariate data of dimensionality m Instead of storingm different trees and reconciling m different block patternsin each iteration one simply stores m different values foreach sufficient statistic in a tree node Since we have totraverse into the combined tree if the coefficient of any ofthe m trees was below the threshold we simply store thelargest N`t[d] among the corresponding nodes of the treeswhich means that the block creation can be done in O(T)instead of O(mT ) i e the dimensionality of the data onlyenters into the creation of the data structure but not thequery during sampling iterations

Automatic priors

While Bayesian methods allow for inductive bias such asthe expected location of means it is desirable to be able touse our method even when little domain knowledge existsor large variation is expected such as the lab and batcheffects commonly observed in micro-arrays [86] as wellas unknown means due to sample contamination SinceFBG does require a prior even in that case we propose the

0 5 10 15Starting position of block (t)

1

2

4

8

16

Bloc

k si

ze (n

)

DFS array (i+1)wavelet treepruning jumps (i+2n-1)

0

1

2

3

4

Leve

l ()

Figure 7 Mapping of wavelets ψ jk and data points yt to treenodes N`t Each node is the root of a subtree with n= 2` leavespruning that subtree yields a block of size n starting at positiont For instance the node N16 is located at position 13 of the DFSarray (solid line) and corresponds to the wavelet ψ33 A block ofsize n = 2 can be created by pruning the subtree which amountsto advancing by 2nminus 1= 3 positions (dashed line) yielding N38

at position 16 which is the wavelet ψ11 Thus the number ofsteps for creating blocks per iteration is at most the number ofnodes in the tree and thus strictly smaller than 2T

1248

163264

128256

Bloc

k si

ze

0 50 100 150 200 250Position along chromosome

10

05

00

05

10

15

20

25

Mea

sure

men

t

Figure 8 Example of dynamic block creation The data is of sizeT = 256 so the wavelet tree contains 512 nodes Here only 37entries had to be checked against the threshold (dark line) 19of which (round markers) yielded a block (vertical lines on thebottom) Sampling is hence done on a short array of 19 blocksinstead of 256 individual values thus the compression ratio is135 The horizontal lines in the bottom subplot are the blockmeans derived from the sufficient statistics in the nodes Noticehow the algorithm creates small blocks around the breakpointse g at t asymp 125 which requires traversing to lower levels andthus induces some additional blocks in other parts of the tree(left subtree) since all block sizes are powers of 2 This some-what reduces the compression ratio which is unproblematic as itincreases the degrees of freedom in the sampler

13

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

following method to specify hyperparameters of a weak priorautomatically Posterior samples of means and variances aredrawn from a Normal-Inverse Gamma distribution (microσ2)simNIΓ(micro0ναβ) whose marginals simply separate into aNormal and an Inverse Gamma distribution

σ2 sim IΓ(αβ) microsim N

micro0σ2

ν

Let s2 be a user-defined variance (or automatically inferede g from the largest of the finest detail coefficients orσ2

MAD) and p the desired probability to sample a variancenot larger than s2 From the CDF of IΓ we obtain

p = P(σ2 le s2) =Γ

α βs2

Γ (α)=Q

αβ

s2

IΓ has a mean for α gt 1 and closed-form solutions forα isin N Furthermore IΓ has positive skewness for α gt 3We thus let α= 2 which yields

β = minuss2

Wminus1

minuspe

+ 1

0lt p le 1

where Wminus1 is the negative branch of the Lambert W -functionwhich is transcendental However an excellent analyticalapproximation with a maximum error of 0025 is given in[87] which yields

β asymp s2

2p

b

M1

pb+p

2

M2 b exp

M3

pb

+ 1 + b

b = minus ln p

M1 = 03361 M2 = minus00042 M3 = minus00201

Since the mean of IΓ is βαminus1 the expected variance of micro is βν

for α= 2 To ensure proper mixing we could simply set βmicroto the sample variance of the data which can be estimatedfrom the sufficient statistics in the root of the wavelet tree(the first entry in the array) provided that micro contained allstates in almost equal number However due to possibleclass imbalance means for short segments far away from micro0can have low sampling probability as they do not contributemuch to the sample variance of the data We thus define δ tobe the sample variance of block means in the compressionobtained by σ2

MAD and take the maximum of those twovariances We thus obtain

micro0 =Σ1

n and ν= βmax

nΣ2 minusΣ21

n2δ

minus1

Numerical issues

To assure numerical stability many HMM implementationsresort to log-space computations Our implementationwhich differs from [62] reflects the block structure andavoids a considerable number of expensive function calls(exp log pow) required by others This yields an additional

speedup of about one order of magnitude compared to thelog version (data not shown) The term accounting for emis-sions and self-transitions within the block can be writtenas

Anwminus1j j

(2π)nw2σnwj

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

Any constant cancels out during normalization Further-more exponentiation of potentially small numbers causesunderflows We hence move those terms into the exponentutilizing the much stabler logarithm function

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

+ (nw minus 1) log A j j minus nw logσ j

Using the blockrsquos sufficient statistics

nw Σ1 =nwsum

k=1

ywk Σ2 =nwsum

k=1

y2wk

the exponent can be rewritten as

Ew( j) =2micro jΣ1 minusΣ2

2σ2j

+ K(nw j)

K(nw j) = (nw minus 1) log A j j minus nw

logσ j +micro2

j

2σ2j

Note that K(nw j) can be precomputed for each iterationthus greatly reducing the number of expensive function callsTo avoid overflow of the exponential function we subtractthe largest such exponents among all states hence Ew( j)le0 This is equivalent to dividing the forward variables by

exp

maxk

Ew(k)

which cancels out during normalization Hence we obtain

αw( j) = exp

Ew( j)minusmaxk

Ew(k)

nwsum

i=1

αwminus1(i)Ai j

which are then normalized to

αw( j) =αw( j)

sum

k αw(k)

Availability of supporting data

The implementation as well as the supplemental data areavailable at httpbioinformaticsrutgerseduSoftwareHaMMLET

Competing interests

The authors declare that the research was conducted in theabsence of any commercial or financial relationships thatcould be construed as a potential conflict of interest

14

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 3: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

CN states The clustering heuristic relies on empirically de-rived parameters not supported by a theoretical analysiswhich can lead to suboptimal clustering or overfitting Alsothe method cannot easily be generalized for multivariatedata Lastly the implementation was primarily aimed atcomparative analysis between the frequentist and Bayesianapproach as opposed to overall speed

To address these issues and make Bayesian CNV inferencefeasible even on a laptop we propose the combination ofHMMs with another popular signal processing technologyHaar wavelets have previously been used in CNV detection[63] mostly as a preprocessing tool for statistical down-stream applications [24ndash28] or simply as a visual aid inGUI applications [17 64] A combination of smoothing andsegmentation has been suggested as likely to improve re-sults [65] and here we show that this is indeed the caseThe wavelets provide a theoretical foundation for a betterdynamic compression scheme for faster convergence andaccelerated Bayesian sampling We improve simultaneouslyupon the typically conflicting goals of accuracy and speedbecause the wavelets allow summary treatment of ldquoeasyrdquoCN calls in segments and focus computational effort on theldquodifficultrdquo CN calls dynamically and adaptively This is incontrast to other computationally efficient tools which oftensimplify the statistical model or use heuristics The requireddata structure can be efficiently computed incurs minimaloverhead and has a straightforward generalization to mul-tivariate data We further show how the wavelet transformyields a natural way to set hyperparameters automaticallywith little input from the user

We implemented our method in a highly optimized end-user software called HaMMLET Aside from achieving anacceleration of up to two orders of magnitude it exhibitssignificantly improved convergence behavior has excellentprecision and recall and provides Bayesian inference withinseconds even for large data sets The accuracy and speedof HaMMLET also makes it an excellent choice for routinediagnostic use and large-scale re-analysis of legacy data

Results and discussion

Simulated aCGH data

A previous survey [65] of eleven CNV calling methods foraCGH has established that segmentation-focused methodssuch as DNAcopy [32 33] an implementation of circularbinary segmentation (CBS) as well as CGHseg [35] performconsistently well DNAcopy performs a number of t-tests todetect break-point candidates The result is typically over-segmented and requires a merging step in post-processingespecially to reduce the number of segment means To thisend MergeLevels was introduced by [66] They comparethe combination DNAcopy+MergeLevels to their own HMMimplementation [45] as well as GLAD [23] showing itssuperior performance over both methods This establishedDNAcopy+MergeLevels as the de facto standard in CNV

lt5 5-10 10-20 gt20

00

02

04

06

08

10

BDampM SRS T

Figure 2 F-measures of CBS (light) and HaMMLET (dark) forcalling aberrant copy numbers on simulate aCGH data Boxesrepresent the interquartile range (IQR = Q3-Q1) with a horizontalline showing the median (Q2) whiskers representing the range( 3

2 IQR beyond Q1 and Q3) and the bullet representing the meanHaMMLET has the same or better F-measures in most cases andconverges to 1 for larger segments whereas CBS plateaus foraberrations greater than 10

detection despite the comparatively long running timeThe paper also includes aCGH simulations deemed to

be reasonably realistic by the community DNACopywas used to segment 145 unpublished samples of breastcancer data and subsequently labeled as copy num-bers 0 to 5 by sorting them into bins with boundaries(minusinfinminus04minus02020406infin) based on the samplemean in each segment (the last bin appears to not beused) Empirical length distributions were derived fromwhich the sizes of CN aberrations are drawn The data it-self is modeled to include Gaussian noise which has beenestablished as a sufficient for aCGH data [67] Meanswere generated such as to mimic random tumor cell pro-portions and random variances were chosen to simulateexperimenter bias often observed in real data this em-phasizes the importance of having automatic priors avail-able when using Bayesian methods as the means and vari-ances might be unknown a priori The simulations areavailable at httpwwwcbsdtudk~hanniaCGHThey comprise three sets of simulations ldquobreakpoint de-tection and mergingrdquo (BDampM) ldquospatial resolution studyrdquo(SRS) and ldquotestingrdquo (T) (see their paper for details) Weused the MergeLevels implementation as provided on theirwebsite It should be noted that the superiority of DNA-copy+MergeLevels was established using a simulation basedupon segmentation results of DNAcopy itself

We used the Bioconductor package DNAcopy (version1240) and followed the procedure suggested therein in-cluding outlier smoothing This version uses the linear-timevariety of CBS [34] note that other authors such as [31]compare against its quadratic-time version [33] which is

3

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

likely to yield a favorable comparison and speedups of sev-eral orders of magnitude especially on large data ForHaMMLET we use a 5-state model with automatic hyperpa-rameters P(σ2 le 001) = 09 (see section Automatic priors)and all Dirichlet hyperparameters set to 1

We report F-measures for detecting aberrant segments(Fig 2) On datasets T and BDampM both methods have simi-lar medians but HaMMLET has a much better interquartilerange (IQR) and range about half of CBSrsquos On the spa-tial resolution data set (SRS) HaMMLET performs muchbetter on very small aberrations This might seem some-what surprising as short segments could easily get lostunder compression However Lai et al [65] have notedthat smoothing-based methods such as quantile smoothing(quantreg) [19] lowess [20] and wavelet smoothing [25]perform particularly well in the presence of high noise andsmall CN aberrations suggesting that ldquoan optimal combi-nation of the smoothing step and the segmentation stepmay result in improved performancerdquo Our wavelet-basedcompression inherits those properties For CNVs of sizesbetween 5 and 10 CBS and HaMMLET have similar rangeswith CBS being skewed towards better values CBS has aslightly higher median for 10ndash20 with IQR and range beingabout the same However while HaMMLETrsquos F-measureconsistently approaches 1 for larger aberrations CBS doesnot appear to significantly improve after size 10

High-density CGH array

In this section we demonstrate HaMMLETrsquos performanceon biological data Due to the lack of a gold standard forhigh-resolution platforms we assess the CNV calls quali-tatively We use raw aCGH data (GEOGSE23949) [68] ofgenomic DNA from breast cancer cell line BT-474 (invasiveductal carcinoma GEOGSM590105) on an Agilent-021529Human CGH Whole Genome Microarray 1x1M platform(GEOGPL8736) We excluded gonosomes mitochondrialand random chromosomes from the data leaving 966432probes in total

HaMMLET allows for using automatic emission priors (seesection Automatic priors) by specifying a noise variance anda probability to sample a variance not exceeding this valueWe compare HaMMLETrsquos performance against CBS usinga 20-state model with automatic priors P(σ2 le 01) = 0810 prior self-transitions and 1 for all other hyperparametersCBS took over 2 h 9 m to process the entire array whereasHaMMLET took 271 s for 100 iterations a speedup of 288The compression ratio (see section Speed and convergenceeffects of wavelet compression) was 2203 CBS yielded amassive over-segmentation into 1548 different copy num-ber levels As the data is derived from a relatively homo-geneous cell line as opposed to a biopsy we do not expectthe presence of subclonal populations to be a contributingfactor [69 70] Instead measurements on aCGH are knownto be spatially correlated resulting in a wave pattern whichhas to be removed in a preprocessing step CBS performs

such a smoothing yet the unrealistically large number ofdifferent levels is likely due to residuals of said wave pat-tern Repeated runs of CBS yielded different numbers oflevels suggesting that indeed the merging was incompleteThis further highlights the importance of prior knowledgein statistical inference By limiting the number of CN statesa priori non-iid effects are easily remedied and inferencecan be performed without preprocessing Furthermore theinternal compression mechanism of HaMMLET is derivedfrom a spatially adaptive regression method so smoothingis inherent to our method The results are included in thesupplement

For a more comprehensive analysis we restricted ourevaluation to chromosome 20 (21687 probes) which weassessed to be the most complicated to infer as it appears tohave the highest number of different CN states and break-points CBS yields a 19-state result after 1578 s (Fig 3top) We have then used a 19-state model automated priors(P(σ2 le 004) = 09) 10 prior self-transitions all otherDirichlet parameters set to 1) to reproduce this result Us-ing noise control (see Methods) our method took 161 sfor 600 iterations The solution we obtained is consistentwith CBS (Fig 3 middle and bottom) However only 11states were part of the final solution i e 6 states yieldedno significant likelihood above that of other states We ob-serve superfluous states being ignored in our simulations aswell cf Supplement In light of the results on the entirearray we suggest that the segmentation by DNAcopy hasnot sufficiently been merged by MergeLevels Most strik-ingly HaMMLET does not show any marginal support fora segment called by CBS around probe number 4500 Wehave confirmed that this is not due to data compressionas the segment is broken up into multiple blocks in eachiteration (data not shown) On the other hand two muchsmaller segments called by CBS in the 17000ndash20000 rangedo have marginal support of about 40 in HaMMLET sug-gesting that the lack of support for the larger segment iscorrect It should be noted that inference differs betweenthe entire array and chromosome 20 in both methods sincelong-range effects have higher impact in larger data

We also demonstrate another feature of HaMMLET callednoise control While Gaussian emissions have been deemed asufficiently accurate noise model for aCGH [67] microarraydata is prone to outliers for example due to damages on thechip While it is possible to model outliers directly [60] thecharacteristics of the wavelet transform allow us to largelysuppress them during the construction of our data structure(see Methods) Notice that due to noise control most outliersare correctly labeled according to the segment they occurin while the short gain segment close to the beginning iscalled correctly

4

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Figure 3 Copy number inference for chromosome 20 in invasive ductal carcinoma (21687 probes) CBS creates a 19-state solution(top) however a compressed 19-state HMM only supports an 11-state solution (bottom) suggesting insufficient level merging in CBS

5

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Speed and convergence effects of wavelet com-pression

The speedup gained by compression depends on how wellthe data can be compressed Poor compression is expectedwhen the means are not well separated or short segmentshave small variance which necessitates the creation ofsmaller blocks for the rest of the data to expose potentiallow-variance segments to the sampler On the other handdata must not be over-compressed to avoid merging smallaberrations with normal segments which would decreasethe F-measure The level of compression is measured withthe compression ratio Due to the dynamic changes to theblock structure we define compression ratio as the totalnumber of blocks in all iterations divided by the productof the number of data points T and the number of itera-tions N As usual a compression ratio of one indicates nocompression

To evaluate the impact of dynamic wavelet compressionon speed and convergence properties of an HMM wecreated 129600 different data sets with T = 32768 manyprobes In each data set we randomly distributed 1 to6 gains of a total length of 1002505007501000uniformly among the data and do the same for lossesMean combinations (microlossmicroneutralmicrogain) were cho-sen from (log2

12 log2 1 log2

32 ) (minus101) (minus202)

and (minus10 0 10) and variances (σ2lossσ

2neutralσ

2gain)

from (005005005) (050109) (030201)(020103) (010302) and (010101) Thesevalues have been selected to yield a wide range of easyand hard cases both well separated low-variance datawith large aberrant segments as well as cases in whichsmall aberrations overlap significantly with the tailsamples of high-variance neutral segments Consequentlycompression ratios range from sim1 to sim2100 We useautomatic priors P(σ2 le 02) = 09 self-transition priorsαii isin 101001000 non-self transition priors αi j = 1and initial state priors α isin 110 Using all possiblecombinations of the above yields 129600 differentsimulated data sets a total of 42 billion values

We achieve speedups per iteration of up to 350 com-pared to an uncompressed HMM (Fig 5) In contrast [62]have reported ratios of 10ndash60 with one instance of 90 No-tice that the speedup is not linear in the compression ratioWhile sampling itself is expected to yield linear speedup themarginal counts still have to be tallied individually for eachposition and dynamic block creation causes some overheadThe quantization artifacts observed for larger speedup arelikely due to the limited resolution of the Linux time com-mand (10 milliseconds) Compressed HaMMLET took about112 CPU hours for all 129600 simulations whereas theuncompressed version took over 3 weeks and 5 days Allrunning times reported are CPU time measured on a singlecore of a AMD OpteronTM 6174 Processor clocked at 22GHz

We evaluate the convergence of the F-measure of com-

Figure 5 HaMMLETrsquos speedup as a function of the average com-pression during sampling As expected higher compression leadsto greater speedup The non-linear characteristic is due to thefact that some overhead is incurred by the dynamic compressionas well as parts of the implementation that do not depend on thecompression such as tallying marginal counts

pressed and uncompressed inference for each simulationSince we are dealing with multi-class classification we usethe micro- and macro-averaged F-measures (Fmi Fma) pro-posed by [71] they tend to be dominated by the classifierrsquosperformance on common and rare categories respectivelySince all state labels are sampled from the same prior andhence their relative order is random we used the labelpermutation which yielded the highest sum of micro- andmacro-averaged F-measures

In Fig 4 we show that the compressed version of theGibbs sampler converges almost instantly whereas the un-compressed version converges much slower with about5 of the cases failing to yield an F-measure gt 06 within1000 iterations Wavelet compression is likely to yield rea-sonably large blocks for the majority class early on whichleads to a strong posterior estimate of its parameters andself-transition probabilities As expected Fma are generallyworse since any misclassification in a rare class has a largerimpact Especially in the uncompressed version we observethat Fma tends to plateau until Fmi approaches 10 Sinceany misclassification in the majority (neutral) class addsfalse positives to the minority classes this effect is expectedIt implies that correct labeling of the majority class is a nec-essary condition for correct labeling of minority classes inother words correct identification of the rare interestingsegments requires the sampler to properly converge whichis much harder to achieve without compression It shouldbe noted that running compressed HaMMLET for 1000iterations is unnecessary on the simulated data as in allcases it converges between 25 and 50 iterations Thus forall practical purposes an additional speedup of 40ndash80 canbe achieved by reducing the number of iterations whichyields convergence up to 3 orders of magnitude faster thanstandard FBG

6

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

00

02

04

06

08

10

Mic

ro-a

vera

ged F

-measu

re

Uncompressed Compressed

0 200 400 600 800 1000Iteration

00

02

04

06

08

10

Macr

o-a

vera

ged F

-measu

re

0 200 400 600 800 1000Iteration

0 25

08

10

0 25

08

10

Figure 4 The median value (black) and quantile ranges (in 5 steps) of the micro- (top) and macro-averaged (bottom) F-measuresfor uncompressed (left) and compressed (right) FBG inference on the same 129600 simulated data sets using automatic priors Thex-axis represents the number of iterations alone and does not reflect the additional speedup obtained through compression Noticethat the compressed HMM converges no later than 50 iterations (inset figures right)

7

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Coriell ATCC and breast carcinoma

The data provided by [72] includes 15 aCGH samples forthe Coriell cell line At about 2000 probes the data issmall compared to modern high-density arrays Neverthe-less these data sets have become a common standard toevaluate CNV calling methods as they contain few and sim-ple aberrations The data also contains 6 ATCC cell lines aswell as 4 breast carcinoma all of which are substantiallymore complicated and typically not used in software eval-uations In Fig 6 we demonstrate our ability to infer thecorrect segments on the most complex example a T47Dbreast ductal carcinoma sample of a 54 year old female Weused 6-state automatic priors with P(σ2 le 01) = 085 andall Dirichlet hyperparameters set to 1 On a standard laptopHaMMLET took 009 seconds for 1000 iterations runningtimes for the other samples were similar Our results for all25 data sets have been included in the supplement

Conclusions

In the analysis of biological data there are usually con-flicting objectives at play which need to be balanced therequired accuracy of the analysis ease of usemdashusing thesoftware setting software and method parametersmdashandoften the speed of a method Bayesian methods have ob-tained a reputation of requiring enormous computationaleffort and being difficult to use for the expert knowledgerequired for choosing prior distributions It has also beenrecognized [60 62 73] that they are very powerful andaccurate leading to improved high-quality results and pro-viding in the form of posterior distributions an accuratemeasure of uncertainty in results Nevertheless it is not sur-prising that a hundred times larger effort in computationaleffort alone prevented wide-spread use

Inferring Copy Number Variants (CNV) is a quite specialproblem as experts can identify CN changes visually atleast on very good data and for large drastic CN changes(e g long segments lost on both chromosomal copies)With lesser quality data smaller CN differences and in theanalysis of cohorts the need for objective highly accurateand automated methods are evident

The core idea for our method expands on our prior work[62] and affirms a conjecture by Lai et al [65] that a com-bination of smoothing and segmentation will likely improveresults One ingredient of our method are Haar waveletswhich have previously been used for pre-processing andvisualization [17 64] In a sense they quantify and identifyregions of high variation and allow to summarize the dataat various levels of resolution somewhat similar to howan expert would perform a visual analysis We combinefor the first time wavelets with a full Bayesian HMM bydynamically and adaptively infering blocks of subsequentobservations from our wavelet data structure The HMMoperates on blocks instead of individual observations whichleads to great saving in running times up to 350-fold com-

pared to the standard FB-Gibbs sampler and up to 288times faster than CBS Much more importantly operatingon the blocks greatly improves convergence of the samplerleading to higher accuracy for a much smaller number ofsampling iterations Thus the combination of wavelets andHMM realizes a simultaneous improvement on accuracyand on speed typically one can have one or the other Anintuitive explanation as to why this works is that the blocksderived from the wavelet structure allow efficient summarytreatment of those ldquoeasyrdquo to call segments given the currentsample of HMM parameters and identifies ldquohardrdquo to callCN segment which need the full computational effort fromFB-Gibbs Note that it is absolutely crucial that the blockstructure depends on the parameters sampled for the HMMand will change drastically over the run time This is incontrast to our prior work [62] which used static blocksand saw no improvements to accuracy and convergencespeed The data structures and linear-time algorithms weintroduce here provide the efficient means for recomputingthese blocks at every cycle of the sampling cf Fig 1 Com-pared to our prior work we observe a speed-up of up to3000 due to the greatly improved convergence O(T) vsO(T log T ) clustering improved numerics and lastly a C++instead of a Python implementation

We performed an extensive comparison with the state-of-the-art as identified by several review and benchmarkpublications and with the standard FB-Gibbs sampler on awide range of biological data sets and 129600 simulateddata sets which were produced by a simulation processnot based on HMM to make it a harder problem for ourmethod All comparisons demonstrated favorable resultsfor our method when measuring accuracy at a very notice-able acceleration compared to the state-of-the-art It mustbe stressed that these results were obtained with a statisti-cally sophisticated model for CNV calls and without cuttingalgorithmic corners but rather an effective allocation ofcomputational effort

All our computations are performed using our automaticprior which derives estimates for the hyperparameters ofthe priors for means and variances directly from the wavelettree structure and the resulting blocks The block structurealso imposes a prior on self-transition probabilities Theuser only has to provide an estimate of the noise variancebut as the automatic prior is designed to be weak the priorand thus the method is robust against incorrect estimatesWe have demonstrated this by using different hyperparam-eters for the associated Dirichlet priors in our simulationswhich HaMMLET is able to infer correctly regardless ofthe transition priors At the same time the automatic priorcan be used to tune certain aspects of the HMM if strongerprior knowledge is available We would expect further im-provements from non-automatic expert-selected priors butrefrained from using them for the evaluation as they mightbe perceived as unfair to other methods

In summary our method is as easy to use as other sta-tistically less sophisticated tools more accurate and much

8

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Figure 6 HaMMLETrsquos inference of copy-number segments on T47D breast ductal carcinoma Notice that the data is much morecomplex than the simple structure of a diploid majority class with some small aberrations typically observed for Coriell data

9

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

more computationally efficient We believe this makes itan excellent choice both for routine use in clinical settingsand principled re-analysis of large cohorts where the addedaccuracy and the much improved information about uncer-tainty in copy number calls from posterior marginal distri-butions will likely yield improved insights into CNV as asource of genetic variation and its relationship to disease

Methods

We will briefly review Forward-Backward Gibbs sampling(FBG) for Bayesian Hidden Markov Models and its accel-eration through compression of the data into blocks Byfirst considering the case of equal emission variances amongall states we show that optimal compression is equivalentto a concept called selective wavelet reconstruction follow-ing a classic proof in wavelet theory We then argue thatwavelet coefficient thresholding a variance-dependent mini-max estimator allows for compression even in the case ofunequal emission variances This allows the compression ofthe data to be adapted to the prior variance level at eachsampling iteration We then derive a simple data structureto dynamically create blocks with little overhead Whilewavelet approaches have been used for aCGH data before[25 29 30 63] our method provides the first combinationof wavelets and HMMs

Bayesian Hidden Markov Models

Let T be the length of the observation sequence which isequal to the number of probes An HMM can be representedas a statistical model (qA θ |y) with transition matrixA a latent state sequence q = (q0 q1 qTminus1) an observedemission sequence y = (y0 y1 yTminus1) and the emissionparameters θ The initial distribution of q0 is often includedand denoted as π but in a Bayesian setting it makes moresense to take π as a set of hyperparameters for q0

In the usual frequentist approach the state sequence q isinferred by first finding a maximum likelihood estimate ofthe parameters

(AMLθML) = arg max(A θ )

L (A θ |y)

using the Baum-Welsh algorithm [53 54] This is only guar-anteed to yield local optima as the likelihood function isnot convex Repeated random reinitializations are used tofind ldquogoodrdquo local optima but there are no guarantees forthis method Then the most likely state sequence giventhose parameters

q= argmaxqP(q |AMLθMLy)

is calculated using the Viterbi algorithm [55] However ifthere are only a few aberrations that is there is imbalancebetween classes the ML parameters tend to overfit the nor-mal state which is likely to yield incorrect segmentation

[62] Furthermore alternative segmentations given thoseparameters are also ignored as are the ones for alternativeparameters

The Bayesian approach is to calculate the distribution ofstate sequences directly by integrating out the emission andtransition variables

P(q |y) =int

A

int

θ

P(qA θ |y)dθ dA

Since this integral is intractable it has to be approximatedusing Markov Chain Monte Carlo techniques i e drawingN samples

(q(i)A (i)θ (i))sim P(qA θ |y)

and subsequently approximating marginal state probabili-ties by their frequency in the sample

P(qt = s |y)asymp1N

Nsum

i=1

I(q(i)t = s)

Thus for each position t we get a complete probabilitydistribution over the possible states As the marginals ofeach variable are explicitly defined by conditioning on theother variables an HMM lends itself to Gibbs samplingi e repeatedly sampling from the marginals (q |A θ y)(A |qθ y) and (θ |qA y) conditioned on the previouslysampled values Using Bayesrsquo formula and several condi-tional independence relations the sampling process can bewritten as

A sim P(A |q) prop P(q |A )P(A |πA )

θ sim P(θ |qy) prop P(qy |θ )P(θ |πθ ) and

qsim P(q |A yθ ) prop P(A yθ |q)P(q |πq)

where π represents hyperparameters to the prior distribu-tion Notice that q depends on a prior only in q0 so πq is aset of parameters from which the distribution of q0 is sam-pled (typically πq are parameters of a Dirichlet distribution)In non-Bayesian settings π often denotes P(q0) itself Typi-cally each prior will be conjugate i e it will be the sameclass of distributions as the posterior Thus πS and πA(k)the hyperparameters of q0 and the k-th row ofA will bethe αi of a Dirichlet distribution and πθ = (αβ νmicro0) willbe the parameters of a Normal-Inverse Gamma distribution

Though there are several sampling schemes available[58] has argued strongly in favor of Forward-BackwardGibbs sampling (FBG) [57] Variations of this have beenimplemented for segmentation of aCGH data before [6062 73] However sampling-based Bayesian methods suchas FBG are computationally expensive Recently [62] haveintroduced compressed FBG by sampling over a shortersequence of sufficient statistics of data segments whichare likely to come from the same underlying state LetB = (Bw)Ww=1 be a partition of y into W blocks Each blockBw contains nw elements Let ywk the k-th element in Bw

10

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

The forward variable αw( j) for this block needs to take intoaccount the nw emissions the transitions into state j andthe nw minus 1 self-transitions which yields

αw( j) = Anwminus1j j P(Bw |micro j σ

2j )

nwsum

i=1

αwminus1(i)Ai j and

P(Bw |microσ2) =nwprod

k=1

P(ywk |microσ2)

The ideal block structure would correspond to the actualunknown segmentation of the data Any subdivision thereofwould decrease the compression ratio and thus the speedupbut still allow for recovery of the true breakpoints In addi-tion such a segmentation would yield sufficient statisticsfor the likelihood computation that corresponds to the trueparameters of the state generating a segment Using wavelettheory we show that such a block structure can be easilyobtained

Wavelet theory preliminaries

Here we review some essential wavelet theory for detailssee [74 75] Let

ψ(x) =

1 0le x lt 12

minus1 12 le x lt 1

0 elsewhere

be the Haar wavelet [76] and ψ jk(x) = 2 j2ψ(2 j x minus k) jand k are called the scale and shift parameter Any square-integrable function over the unit interval f isin L2([01)) canbe approximated using the orthonormal basis ψ jk | j k isinZminus1le j 0le k le 2 jminus1 admitting a multiresolution analy-sis [77 78] Essentially this allows us to express a functionf (x) using scaled and shifted copies of one simple basisfunction ψ(x) which is spatially localized i e non-zero ononly a finite interval in x The Haar basis is particularlysuited for expressing piecewise constant functions

Finite data y = (y0 yTminus1) can be treated as an equidis-tant sample f (x) by scaling the indices to the unit intervalusing x t = t

T Let h = log2 T Then y can be expressedexactly as a linear combination over the Haar wavelet basisabove restricted to the maximum level of sampling resolu-tion ( j le hminus 1)

yt =sum

jk

d jkψ jk(x t)

The wavelet transform d =W y is an orthogonal endomor-phism and thus incurs neither redundancy nor loss of in-formation Surprisingly d can be computed in linear timeusing the pyramid algorithm [77]

Compression via wavelet shrinkage

The Haar wavelet transform has an important property con-necting it to block creation Let d be a vector obtained

by setting elements of d to zero then y =W ᵀd = W ᵀd iscalled selective wavelet reconstruction (SW) If all coefficientsd jk with ψ jk(x t) 6= ψ jk(x t+1) are set to zero for some tthen yt = yt+1 which implies a block structure on y Con-versely blocks of size gt 2 (to account for some pathologicalcases) can only be created using SW This general equivalencebetween SW and compression is central to our method Notethat y does not have to be computed explicitly the blockboundaries can be inferred from the position of zero-entriesin d alone

Assume all HMM states had the same emission varianceσ2 Since each state is associated with an emission meanfinding q can be viewed as a regression or smoothing prob-lem of finding an estimate micro of a piecewise constant functionmicro whose range is precisely the set of emission means i e

micro= f (x) y= f (x) + ε εsimiid N(0σ2)

Unfortunately regression methods typically do not limitthe number of distinct values recovered and will insteadreturn some estimate y 6= micro However if y is piecewiseconstant and minimizes microminus y the sample means of eachblock are close to the true emission means This yieldshigh likelihood for their corresponding state and hence astrong posterior distribution leading to fast convergenceFurthermore the change points in micromust be close to changepoints in y since moving block boundaries incurs additionalloss allowing for approximate recovery of true breakpointsy might however induce additional block boundaries thatreduce the compression ratio

In a series of ground-breaking papers Donoho Johnstoneet al [79ndash83] showed that SW could in theory be used as analmost ideal spatially adaptive regression method Assum-ing one could provide an oracle ∆(microy) that would knowthe true micro then there exists a method MSW(y∆) = W ᵀSWusing an optimal subset of wavelet coefficients provided by∆ such that the quadratic risk of ySW = W ᵀSWd is boundedas

microminus ySW22 = O

σ2 ln TT

By definition MSW would be the best compression methodunder the constraints of the Haar wavelet basis This boundis generally unattainable since the oracle cannot be queriedInstead they have shown that for a method MWCT(yλσ)called wavelet coefficient thresholding which sets coefficientsto zero whose absolute value is smaller than some thresholdλσ there exists some λT le

p2 ln T with yWCT = W ᵀWCTd

such that

microminus yWCT22 le (2 ln T + 1)

microminus ySW22 +σ2

T

This λT is minimax i e the maximum risk incured overall possible data sets is not larger than that of any otherthreshold and no better bound can be obtained It is noteasily computed but for large T on the order of tens tohundreds the universal threshold λu

T =p

2 ln T is asymp-totically minimax In other words for data large enough

11

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

to warrant compression universal thresholding is the bestmethod to approximate micro and thus the best wavelet-basedcompression scheme for a given noise level σ2

Integrating wavelet shrinkage into FBG

This compression method can easily be extended to multipleemission variances Since we use a thresholding methoddecreasing the variance simply subdivides existing blocks Ifthe threshold is set to the smallest emission variance amongall states y will approximately preserve the breakpointsaround those low-variance segments Those of high vari-ance are split into blocks of lower sample variance see[84 85] for an analytic expression While the variances forthe different states are not known FBG provides a priorisamples in each iteration We hence propose the follow-ing simple adaptation In each sampling iteration use thesmallest sampled variance parameter to create a new blocksequence via wavelet thresholding (Algorithm 1)

Algorithm 1 Dynamically adaptive FBG for HMMs

1 procedure HAMMLET(y πA πθ πq)2 T larr |y |3 λlarr

p2 ln T

4 A sim P(A |πA )5 θ sim P(θ |πθ )6 for i = 1 N do7 σminlarrminσi

σMADσi |σ2i isin θ

8 Create block sequence B from threshold λσmin9 qsim P(A Bθ |q)P(q |πq)

10 A sim P(q |A )P(A |πA )11 θ sim P(qB |θ )P(θ |πθ )12 Add count of marginal states for q to result13 end for14 end procedure

While intuitively easy to understand provable guaranteesfor the optimality of this method specifically the correspon-dence between the wavelet and the HMM domain remainan open research topic A potential problem could ariseif all sampled variances are too large In this case blockswould be under-segmented yield wrong posterior variancesand hide possible state transitions As a safeguard againstover-compression we use the standard method to estimatethe variance of constant noise in shrinkage applications

σ2MAD =

MADkdlog2 Tminus1k

Φminus1

34

2

as an estimate of the variance in the dominant com-ponent and modify the threshold definition to λ middotmin σMADσi isin θ If the data is not iid σ2

MAD will sys-tematically underestimate the true variance [24] In thiscase the blocks get smaller than necessary thus decreasingthe compression

A data structure for dynamic compression

The necessity to recreate a new block sequence in each it-eration based on the most recent estimate of the smallestvariance parameter creates the challenge of doing so with lit-tle computational overhead specifically without repeatedlycomputing the inverse wavelet transform or considering allT elements in other ways We achieve this by creating asimple tree-based data structure

The pyramid algorithm yields d sorted according to ( j k)Again let h = log2 T and ` = h minus j We can map thewavelet ψ jk to a perfect binary tree of height h such thatall wavelets for scale j are nodes on level ` nodes withineach level are sorted according to k and ` is increasingfrom the leaves to the root (Fig 7) d represents a breadth-first search (BFS) traversal of that tree with d jk being theentry at position b2 jc+ k Adding yi as the i-th leaf on level` = 0 each non-leaf node represents a wavelet which isnon-zero for the n = 2` data points yt for t in the theinterval I jk = [kn (k+ 1)nminus 1] stored in the leaves belownotice that for the leaves kn= t

This implies that the leaves in any subtree all have thesame value after wavelet thresholding if all the wavelets inthis subtree are set to zero We can hence avoid computingthe inverse wavelet transform to create blocks Instead eachnode stores the maximum absolute wavelet coefficient in theentire subtree as well as the sufficient statistics required forcalculating the likelihood function More formally a nodeN`t corresponds to wavelet ψ jk with ` = hminus j and t = k2`

(ψminus10 is simply constant on the [01) interval and has noeffect on block creation thus we discard it) Essentially `numbers the levels beginning at the leaves and t marks thestart position of the block when pruning the subtree rootedat N`t The members stored in each node are

bull The number of leaves corresponding to the block size

N`t[n] = 2`

bull The sum of data points stored in the subtree leaves

N`t[Σ1] =sum

iisinI jk

yi

bull Similarly the sum of squares

N`t[Σ2] =sum

iisinI jk

y2i

bull The maximum absolute wavelet coefficient of the sub-tree including the current d jk itself

N0t[d] = 0 N`gt0t[d] =max`primele`

tlet primeltt+2`

dhminus`prime2`primet prime

All these values can be computed recursively from the childnodes in linear time As some real data sets contain salt-and-pepper noise which manifests as isolated large coefficients

12

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

on the lowest level its is possible to ignore the first level inthe maximum computation so that no information to createa single-element block for outliers is passed up the tree Werefer to this technique as noise control Notice that this doesnot imply that blocks are only created at even t since truetransitions manifest in coefficients on multiple levels

The block creation algorithm is simple upon constructionthe tree is converted to depth-first search (DFS) order whichsimply amounts to sorting the BFS array according to (kn j)Given a threshold the tree is then traversed in DFS order byiterating linearly over the array (Fig 7 solid lines) Once themaximum coefficient stored in a node is less or equal to thethreshold a block of size n is created and the entire subtreeis skipped (dashed lines) As the tree is perfect binary andcomplete the next array position in DFS traversal afterpruning the subtree rooted at the node at index i is simplyobtained as i + 2nminus 1 so no expensive pointer structureneeds to be maintained leaving the tree data structure asimple flat array An example of dynamic block creation isgiven in Fig 8

Once the Gibbs sampler converges to a set of variancesthe block structure is less likely to change To avoid recre-ating the same block structure over and over again weemploy a technique called block structure prediction Sincethe different block structures are subdivisions of each otherthat occur in a specific order for decreasing σ2 there is asimple bijection between the number of blocks and the blockstructure itself Thus for each block sequence length weregister the minimum and maximum variance that createsthat sequence Upon entering a new iteration we checkif the current variance would create the same number ofblocks as in the previous iteration which guarantees thatwe would obtain the same block sequence and hence canavoid recomputation

The wavelet tree data structure can be readily extendedto multivariate data of dimensionality m Instead of storingm different trees and reconciling m different block patternsin each iteration one simply stores m different values foreach sufficient statistic in a tree node Since we have totraverse into the combined tree if the coefficient of any ofthe m trees was below the threshold we simply store thelargest N`t[d] among the corresponding nodes of the treeswhich means that the block creation can be done in O(T)instead of O(mT ) i e the dimensionality of the data onlyenters into the creation of the data structure but not thequery during sampling iterations

Automatic priors

While Bayesian methods allow for inductive bias such asthe expected location of means it is desirable to be able touse our method even when little domain knowledge existsor large variation is expected such as the lab and batcheffects commonly observed in micro-arrays [86] as wellas unknown means due to sample contamination SinceFBG does require a prior even in that case we propose the

0 5 10 15Starting position of block (t)

1

2

4

8

16

Bloc

k si

ze (n

)

DFS array (i+1)wavelet treepruning jumps (i+2n-1)

0

1

2

3

4

Leve

l ()

Figure 7 Mapping of wavelets ψ jk and data points yt to treenodes N`t Each node is the root of a subtree with n= 2` leavespruning that subtree yields a block of size n starting at positiont For instance the node N16 is located at position 13 of the DFSarray (solid line) and corresponds to the wavelet ψ33 A block ofsize n = 2 can be created by pruning the subtree which amountsto advancing by 2nminus 1= 3 positions (dashed line) yielding N38

at position 16 which is the wavelet ψ11 Thus the number ofsteps for creating blocks per iteration is at most the number ofnodes in the tree and thus strictly smaller than 2T

1248

163264

128256

Bloc

k si

ze

0 50 100 150 200 250Position along chromosome

10

05

00

05

10

15

20

25

Mea

sure

men

t

Figure 8 Example of dynamic block creation The data is of sizeT = 256 so the wavelet tree contains 512 nodes Here only 37entries had to be checked against the threshold (dark line) 19of which (round markers) yielded a block (vertical lines on thebottom) Sampling is hence done on a short array of 19 blocksinstead of 256 individual values thus the compression ratio is135 The horizontal lines in the bottom subplot are the blockmeans derived from the sufficient statistics in the nodes Noticehow the algorithm creates small blocks around the breakpointse g at t asymp 125 which requires traversing to lower levels andthus induces some additional blocks in other parts of the tree(left subtree) since all block sizes are powers of 2 This some-what reduces the compression ratio which is unproblematic as itincreases the degrees of freedom in the sampler

13

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

following method to specify hyperparameters of a weak priorautomatically Posterior samples of means and variances aredrawn from a Normal-Inverse Gamma distribution (microσ2)simNIΓ(micro0ναβ) whose marginals simply separate into aNormal and an Inverse Gamma distribution

σ2 sim IΓ(αβ) microsim N

micro0σ2

ν

Let s2 be a user-defined variance (or automatically inferede g from the largest of the finest detail coefficients orσ2

MAD) and p the desired probability to sample a variancenot larger than s2 From the CDF of IΓ we obtain

p = P(σ2 le s2) =Γ

α βs2

Γ (α)=Q

αβ

s2

IΓ has a mean for α gt 1 and closed-form solutions forα isin N Furthermore IΓ has positive skewness for α gt 3We thus let α= 2 which yields

β = minuss2

Wminus1

minuspe

+ 1

0lt p le 1

where Wminus1 is the negative branch of the Lambert W -functionwhich is transcendental However an excellent analyticalapproximation with a maximum error of 0025 is given in[87] which yields

β asymp s2

2p

b

M1

pb+p

2

M2 b exp

M3

pb

+ 1 + b

b = minus ln p

M1 = 03361 M2 = minus00042 M3 = minus00201

Since the mean of IΓ is βαminus1 the expected variance of micro is βν

for α= 2 To ensure proper mixing we could simply set βmicroto the sample variance of the data which can be estimatedfrom the sufficient statistics in the root of the wavelet tree(the first entry in the array) provided that micro contained allstates in almost equal number However due to possibleclass imbalance means for short segments far away from micro0can have low sampling probability as they do not contributemuch to the sample variance of the data We thus define δ tobe the sample variance of block means in the compressionobtained by σ2

MAD and take the maximum of those twovariances We thus obtain

micro0 =Σ1

n and ν= βmax

nΣ2 minusΣ21

n2δ

minus1

Numerical issues

To assure numerical stability many HMM implementationsresort to log-space computations Our implementationwhich differs from [62] reflects the block structure andavoids a considerable number of expensive function calls(exp log pow) required by others This yields an additional

speedup of about one order of magnitude compared to thelog version (data not shown) The term accounting for emis-sions and self-transitions within the block can be writtenas

Anwminus1j j

(2π)nw2σnwj

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

Any constant cancels out during normalization Further-more exponentiation of potentially small numbers causesunderflows We hence move those terms into the exponentutilizing the much stabler logarithm function

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

+ (nw minus 1) log A j j minus nw logσ j

Using the blockrsquos sufficient statistics

nw Σ1 =nwsum

k=1

ywk Σ2 =nwsum

k=1

y2wk

the exponent can be rewritten as

Ew( j) =2micro jΣ1 minusΣ2

2σ2j

+ K(nw j)

K(nw j) = (nw minus 1) log A j j minus nw

logσ j +micro2

j

2σ2j

Note that K(nw j) can be precomputed for each iterationthus greatly reducing the number of expensive function callsTo avoid overflow of the exponential function we subtractthe largest such exponents among all states hence Ew( j)le0 This is equivalent to dividing the forward variables by

exp

maxk

Ew(k)

which cancels out during normalization Hence we obtain

αw( j) = exp

Ew( j)minusmaxk

Ew(k)

nwsum

i=1

αwminus1(i)Ai j

which are then normalized to

αw( j) =αw( j)

sum

k αw(k)

Availability of supporting data

The implementation as well as the supplemental data areavailable at httpbioinformaticsrutgerseduSoftwareHaMMLET

Competing interests

The authors declare that the research was conducted in theabsence of any commercial or financial relationships thatcould be construed as a potential conflict of interest

14

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 4: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

likely to yield a favorable comparison and speedups of sev-eral orders of magnitude especially on large data ForHaMMLET we use a 5-state model with automatic hyperpa-rameters P(σ2 le 001) = 09 (see section Automatic priors)and all Dirichlet hyperparameters set to 1

We report F-measures for detecting aberrant segments(Fig 2) On datasets T and BDampM both methods have simi-lar medians but HaMMLET has a much better interquartilerange (IQR) and range about half of CBSrsquos On the spa-tial resolution data set (SRS) HaMMLET performs muchbetter on very small aberrations This might seem some-what surprising as short segments could easily get lostunder compression However Lai et al [65] have notedthat smoothing-based methods such as quantile smoothing(quantreg) [19] lowess [20] and wavelet smoothing [25]perform particularly well in the presence of high noise andsmall CN aberrations suggesting that ldquoan optimal combi-nation of the smoothing step and the segmentation stepmay result in improved performancerdquo Our wavelet-basedcompression inherits those properties For CNVs of sizesbetween 5 and 10 CBS and HaMMLET have similar rangeswith CBS being skewed towards better values CBS has aslightly higher median for 10ndash20 with IQR and range beingabout the same However while HaMMLETrsquos F-measureconsistently approaches 1 for larger aberrations CBS doesnot appear to significantly improve after size 10

High-density CGH array

In this section we demonstrate HaMMLETrsquos performanceon biological data Due to the lack of a gold standard forhigh-resolution platforms we assess the CNV calls quali-tatively We use raw aCGH data (GEOGSE23949) [68] ofgenomic DNA from breast cancer cell line BT-474 (invasiveductal carcinoma GEOGSM590105) on an Agilent-021529Human CGH Whole Genome Microarray 1x1M platform(GEOGPL8736) We excluded gonosomes mitochondrialand random chromosomes from the data leaving 966432probes in total

HaMMLET allows for using automatic emission priors (seesection Automatic priors) by specifying a noise variance anda probability to sample a variance not exceeding this valueWe compare HaMMLETrsquos performance against CBS usinga 20-state model with automatic priors P(σ2 le 01) = 0810 prior self-transitions and 1 for all other hyperparametersCBS took over 2 h 9 m to process the entire array whereasHaMMLET took 271 s for 100 iterations a speedup of 288The compression ratio (see section Speed and convergenceeffects of wavelet compression) was 2203 CBS yielded amassive over-segmentation into 1548 different copy num-ber levels As the data is derived from a relatively homo-geneous cell line as opposed to a biopsy we do not expectthe presence of subclonal populations to be a contributingfactor [69 70] Instead measurements on aCGH are knownto be spatially correlated resulting in a wave pattern whichhas to be removed in a preprocessing step CBS performs

such a smoothing yet the unrealistically large number ofdifferent levels is likely due to residuals of said wave pat-tern Repeated runs of CBS yielded different numbers oflevels suggesting that indeed the merging was incompleteThis further highlights the importance of prior knowledgein statistical inference By limiting the number of CN statesa priori non-iid effects are easily remedied and inferencecan be performed without preprocessing Furthermore theinternal compression mechanism of HaMMLET is derivedfrom a spatially adaptive regression method so smoothingis inherent to our method The results are included in thesupplement

For a more comprehensive analysis we restricted ourevaluation to chromosome 20 (21687 probes) which weassessed to be the most complicated to infer as it appears tohave the highest number of different CN states and break-points CBS yields a 19-state result after 1578 s (Fig 3top) We have then used a 19-state model automated priors(P(σ2 le 004) = 09) 10 prior self-transitions all otherDirichlet parameters set to 1) to reproduce this result Us-ing noise control (see Methods) our method took 161 sfor 600 iterations The solution we obtained is consistentwith CBS (Fig 3 middle and bottom) However only 11states were part of the final solution i e 6 states yieldedno significant likelihood above that of other states We ob-serve superfluous states being ignored in our simulations aswell cf Supplement In light of the results on the entirearray we suggest that the segmentation by DNAcopy hasnot sufficiently been merged by MergeLevels Most strik-ingly HaMMLET does not show any marginal support fora segment called by CBS around probe number 4500 Wehave confirmed that this is not due to data compressionas the segment is broken up into multiple blocks in eachiteration (data not shown) On the other hand two muchsmaller segments called by CBS in the 17000ndash20000 rangedo have marginal support of about 40 in HaMMLET sug-gesting that the lack of support for the larger segment iscorrect It should be noted that inference differs betweenthe entire array and chromosome 20 in both methods sincelong-range effects have higher impact in larger data

We also demonstrate another feature of HaMMLET callednoise control While Gaussian emissions have been deemed asufficiently accurate noise model for aCGH [67] microarraydata is prone to outliers for example due to damages on thechip While it is possible to model outliers directly [60] thecharacteristics of the wavelet transform allow us to largelysuppress them during the construction of our data structure(see Methods) Notice that due to noise control most outliersare correctly labeled according to the segment they occurin while the short gain segment close to the beginning iscalled correctly

4

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Figure 3 Copy number inference for chromosome 20 in invasive ductal carcinoma (21687 probes) CBS creates a 19-state solution(top) however a compressed 19-state HMM only supports an 11-state solution (bottom) suggesting insufficient level merging in CBS

5

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Speed and convergence effects of wavelet com-pression

The speedup gained by compression depends on how wellthe data can be compressed Poor compression is expectedwhen the means are not well separated or short segmentshave small variance which necessitates the creation ofsmaller blocks for the rest of the data to expose potentiallow-variance segments to the sampler On the other handdata must not be over-compressed to avoid merging smallaberrations with normal segments which would decreasethe F-measure The level of compression is measured withthe compression ratio Due to the dynamic changes to theblock structure we define compression ratio as the totalnumber of blocks in all iterations divided by the productof the number of data points T and the number of itera-tions N As usual a compression ratio of one indicates nocompression

To evaluate the impact of dynamic wavelet compressionon speed and convergence properties of an HMM wecreated 129600 different data sets with T = 32768 manyprobes In each data set we randomly distributed 1 to6 gains of a total length of 1002505007501000uniformly among the data and do the same for lossesMean combinations (microlossmicroneutralmicrogain) were cho-sen from (log2

12 log2 1 log2

32 ) (minus101) (minus202)

and (minus10 0 10) and variances (σ2lossσ

2neutralσ

2gain)

from (005005005) (050109) (030201)(020103) (010302) and (010101) Thesevalues have been selected to yield a wide range of easyand hard cases both well separated low-variance datawith large aberrant segments as well as cases in whichsmall aberrations overlap significantly with the tailsamples of high-variance neutral segments Consequentlycompression ratios range from sim1 to sim2100 We useautomatic priors P(σ2 le 02) = 09 self-transition priorsαii isin 101001000 non-self transition priors αi j = 1and initial state priors α isin 110 Using all possiblecombinations of the above yields 129600 differentsimulated data sets a total of 42 billion values

We achieve speedups per iteration of up to 350 com-pared to an uncompressed HMM (Fig 5) In contrast [62]have reported ratios of 10ndash60 with one instance of 90 No-tice that the speedup is not linear in the compression ratioWhile sampling itself is expected to yield linear speedup themarginal counts still have to be tallied individually for eachposition and dynamic block creation causes some overheadThe quantization artifacts observed for larger speedup arelikely due to the limited resolution of the Linux time com-mand (10 milliseconds) Compressed HaMMLET took about112 CPU hours for all 129600 simulations whereas theuncompressed version took over 3 weeks and 5 days Allrunning times reported are CPU time measured on a singlecore of a AMD OpteronTM 6174 Processor clocked at 22GHz

We evaluate the convergence of the F-measure of com-

Figure 5 HaMMLETrsquos speedup as a function of the average com-pression during sampling As expected higher compression leadsto greater speedup The non-linear characteristic is due to thefact that some overhead is incurred by the dynamic compressionas well as parts of the implementation that do not depend on thecompression such as tallying marginal counts

pressed and uncompressed inference for each simulationSince we are dealing with multi-class classification we usethe micro- and macro-averaged F-measures (Fmi Fma) pro-posed by [71] they tend to be dominated by the classifierrsquosperformance on common and rare categories respectivelySince all state labels are sampled from the same prior andhence their relative order is random we used the labelpermutation which yielded the highest sum of micro- andmacro-averaged F-measures

In Fig 4 we show that the compressed version of theGibbs sampler converges almost instantly whereas the un-compressed version converges much slower with about5 of the cases failing to yield an F-measure gt 06 within1000 iterations Wavelet compression is likely to yield rea-sonably large blocks for the majority class early on whichleads to a strong posterior estimate of its parameters andself-transition probabilities As expected Fma are generallyworse since any misclassification in a rare class has a largerimpact Especially in the uncompressed version we observethat Fma tends to plateau until Fmi approaches 10 Sinceany misclassification in the majority (neutral) class addsfalse positives to the minority classes this effect is expectedIt implies that correct labeling of the majority class is a nec-essary condition for correct labeling of minority classes inother words correct identification of the rare interestingsegments requires the sampler to properly converge whichis much harder to achieve without compression It shouldbe noted that running compressed HaMMLET for 1000iterations is unnecessary on the simulated data as in allcases it converges between 25 and 50 iterations Thus forall practical purposes an additional speedup of 40ndash80 canbe achieved by reducing the number of iterations whichyields convergence up to 3 orders of magnitude faster thanstandard FBG

6

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

00

02

04

06

08

10

Mic

ro-a

vera

ged F

-measu

re

Uncompressed Compressed

0 200 400 600 800 1000Iteration

00

02

04

06

08

10

Macr

o-a

vera

ged F

-measu

re

0 200 400 600 800 1000Iteration

0 25

08

10

0 25

08

10

Figure 4 The median value (black) and quantile ranges (in 5 steps) of the micro- (top) and macro-averaged (bottom) F-measuresfor uncompressed (left) and compressed (right) FBG inference on the same 129600 simulated data sets using automatic priors Thex-axis represents the number of iterations alone and does not reflect the additional speedup obtained through compression Noticethat the compressed HMM converges no later than 50 iterations (inset figures right)

7

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Coriell ATCC and breast carcinoma

The data provided by [72] includes 15 aCGH samples forthe Coriell cell line At about 2000 probes the data issmall compared to modern high-density arrays Neverthe-less these data sets have become a common standard toevaluate CNV calling methods as they contain few and sim-ple aberrations The data also contains 6 ATCC cell lines aswell as 4 breast carcinoma all of which are substantiallymore complicated and typically not used in software eval-uations In Fig 6 we demonstrate our ability to infer thecorrect segments on the most complex example a T47Dbreast ductal carcinoma sample of a 54 year old female Weused 6-state automatic priors with P(σ2 le 01) = 085 andall Dirichlet hyperparameters set to 1 On a standard laptopHaMMLET took 009 seconds for 1000 iterations runningtimes for the other samples were similar Our results for all25 data sets have been included in the supplement

Conclusions

In the analysis of biological data there are usually con-flicting objectives at play which need to be balanced therequired accuracy of the analysis ease of usemdashusing thesoftware setting software and method parametersmdashandoften the speed of a method Bayesian methods have ob-tained a reputation of requiring enormous computationaleffort and being difficult to use for the expert knowledgerequired for choosing prior distributions It has also beenrecognized [60 62 73] that they are very powerful andaccurate leading to improved high-quality results and pro-viding in the form of posterior distributions an accuratemeasure of uncertainty in results Nevertheless it is not sur-prising that a hundred times larger effort in computationaleffort alone prevented wide-spread use

Inferring Copy Number Variants (CNV) is a quite specialproblem as experts can identify CN changes visually atleast on very good data and for large drastic CN changes(e g long segments lost on both chromosomal copies)With lesser quality data smaller CN differences and in theanalysis of cohorts the need for objective highly accurateand automated methods are evident

The core idea for our method expands on our prior work[62] and affirms a conjecture by Lai et al [65] that a com-bination of smoothing and segmentation will likely improveresults One ingredient of our method are Haar waveletswhich have previously been used for pre-processing andvisualization [17 64] In a sense they quantify and identifyregions of high variation and allow to summarize the dataat various levels of resolution somewhat similar to howan expert would perform a visual analysis We combinefor the first time wavelets with a full Bayesian HMM bydynamically and adaptively infering blocks of subsequentobservations from our wavelet data structure The HMMoperates on blocks instead of individual observations whichleads to great saving in running times up to 350-fold com-

pared to the standard FB-Gibbs sampler and up to 288times faster than CBS Much more importantly operatingon the blocks greatly improves convergence of the samplerleading to higher accuracy for a much smaller number ofsampling iterations Thus the combination of wavelets andHMM realizes a simultaneous improvement on accuracyand on speed typically one can have one or the other Anintuitive explanation as to why this works is that the blocksderived from the wavelet structure allow efficient summarytreatment of those ldquoeasyrdquo to call segments given the currentsample of HMM parameters and identifies ldquohardrdquo to callCN segment which need the full computational effort fromFB-Gibbs Note that it is absolutely crucial that the blockstructure depends on the parameters sampled for the HMMand will change drastically over the run time This is incontrast to our prior work [62] which used static blocksand saw no improvements to accuracy and convergencespeed The data structures and linear-time algorithms weintroduce here provide the efficient means for recomputingthese blocks at every cycle of the sampling cf Fig 1 Com-pared to our prior work we observe a speed-up of up to3000 due to the greatly improved convergence O(T) vsO(T log T ) clustering improved numerics and lastly a C++instead of a Python implementation

We performed an extensive comparison with the state-of-the-art as identified by several review and benchmarkpublications and with the standard FB-Gibbs sampler on awide range of biological data sets and 129600 simulateddata sets which were produced by a simulation processnot based on HMM to make it a harder problem for ourmethod All comparisons demonstrated favorable resultsfor our method when measuring accuracy at a very notice-able acceleration compared to the state-of-the-art It mustbe stressed that these results were obtained with a statisti-cally sophisticated model for CNV calls and without cuttingalgorithmic corners but rather an effective allocation ofcomputational effort

All our computations are performed using our automaticprior which derives estimates for the hyperparameters ofthe priors for means and variances directly from the wavelettree structure and the resulting blocks The block structurealso imposes a prior on self-transition probabilities Theuser only has to provide an estimate of the noise variancebut as the automatic prior is designed to be weak the priorand thus the method is robust against incorrect estimatesWe have demonstrated this by using different hyperparam-eters for the associated Dirichlet priors in our simulationswhich HaMMLET is able to infer correctly regardless ofthe transition priors At the same time the automatic priorcan be used to tune certain aspects of the HMM if strongerprior knowledge is available We would expect further im-provements from non-automatic expert-selected priors butrefrained from using them for the evaluation as they mightbe perceived as unfair to other methods

In summary our method is as easy to use as other sta-tistically less sophisticated tools more accurate and much

8

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Figure 6 HaMMLETrsquos inference of copy-number segments on T47D breast ductal carcinoma Notice that the data is much morecomplex than the simple structure of a diploid majority class with some small aberrations typically observed for Coriell data

9

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

more computationally efficient We believe this makes itan excellent choice both for routine use in clinical settingsand principled re-analysis of large cohorts where the addedaccuracy and the much improved information about uncer-tainty in copy number calls from posterior marginal distri-butions will likely yield improved insights into CNV as asource of genetic variation and its relationship to disease

Methods

We will briefly review Forward-Backward Gibbs sampling(FBG) for Bayesian Hidden Markov Models and its accel-eration through compression of the data into blocks Byfirst considering the case of equal emission variances amongall states we show that optimal compression is equivalentto a concept called selective wavelet reconstruction follow-ing a classic proof in wavelet theory We then argue thatwavelet coefficient thresholding a variance-dependent mini-max estimator allows for compression even in the case ofunequal emission variances This allows the compression ofthe data to be adapted to the prior variance level at eachsampling iteration We then derive a simple data structureto dynamically create blocks with little overhead Whilewavelet approaches have been used for aCGH data before[25 29 30 63] our method provides the first combinationof wavelets and HMMs

Bayesian Hidden Markov Models

Let T be the length of the observation sequence which isequal to the number of probes An HMM can be representedas a statistical model (qA θ |y) with transition matrixA a latent state sequence q = (q0 q1 qTminus1) an observedemission sequence y = (y0 y1 yTminus1) and the emissionparameters θ The initial distribution of q0 is often includedand denoted as π but in a Bayesian setting it makes moresense to take π as a set of hyperparameters for q0

In the usual frequentist approach the state sequence q isinferred by first finding a maximum likelihood estimate ofthe parameters

(AMLθML) = arg max(A θ )

L (A θ |y)

using the Baum-Welsh algorithm [53 54] This is only guar-anteed to yield local optima as the likelihood function isnot convex Repeated random reinitializations are used tofind ldquogoodrdquo local optima but there are no guarantees forthis method Then the most likely state sequence giventhose parameters

q= argmaxqP(q |AMLθMLy)

is calculated using the Viterbi algorithm [55] However ifthere are only a few aberrations that is there is imbalancebetween classes the ML parameters tend to overfit the nor-mal state which is likely to yield incorrect segmentation

[62] Furthermore alternative segmentations given thoseparameters are also ignored as are the ones for alternativeparameters

The Bayesian approach is to calculate the distribution ofstate sequences directly by integrating out the emission andtransition variables

P(q |y) =int

A

int

θ

P(qA θ |y)dθ dA

Since this integral is intractable it has to be approximatedusing Markov Chain Monte Carlo techniques i e drawingN samples

(q(i)A (i)θ (i))sim P(qA θ |y)

and subsequently approximating marginal state probabili-ties by their frequency in the sample

P(qt = s |y)asymp1N

Nsum

i=1

I(q(i)t = s)

Thus for each position t we get a complete probabilitydistribution over the possible states As the marginals ofeach variable are explicitly defined by conditioning on theother variables an HMM lends itself to Gibbs samplingi e repeatedly sampling from the marginals (q |A θ y)(A |qθ y) and (θ |qA y) conditioned on the previouslysampled values Using Bayesrsquo formula and several condi-tional independence relations the sampling process can bewritten as

A sim P(A |q) prop P(q |A )P(A |πA )

θ sim P(θ |qy) prop P(qy |θ )P(θ |πθ ) and

qsim P(q |A yθ ) prop P(A yθ |q)P(q |πq)

where π represents hyperparameters to the prior distribu-tion Notice that q depends on a prior only in q0 so πq is aset of parameters from which the distribution of q0 is sam-pled (typically πq are parameters of a Dirichlet distribution)In non-Bayesian settings π often denotes P(q0) itself Typi-cally each prior will be conjugate i e it will be the sameclass of distributions as the posterior Thus πS and πA(k)the hyperparameters of q0 and the k-th row ofA will bethe αi of a Dirichlet distribution and πθ = (αβ νmicro0) willbe the parameters of a Normal-Inverse Gamma distribution

Though there are several sampling schemes available[58] has argued strongly in favor of Forward-BackwardGibbs sampling (FBG) [57] Variations of this have beenimplemented for segmentation of aCGH data before [6062 73] However sampling-based Bayesian methods suchas FBG are computationally expensive Recently [62] haveintroduced compressed FBG by sampling over a shortersequence of sufficient statistics of data segments whichare likely to come from the same underlying state LetB = (Bw)Ww=1 be a partition of y into W blocks Each blockBw contains nw elements Let ywk the k-th element in Bw

10

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

The forward variable αw( j) for this block needs to take intoaccount the nw emissions the transitions into state j andthe nw minus 1 self-transitions which yields

αw( j) = Anwminus1j j P(Bw |micro j σ

2j )

nwsum

i=1

αwminus1(i)Ai j and

P(Bw |microσ2) =nwprod

k=1

P(ywk |microσ2)

The ideal block structure would correspond to the actualunknown segmentation of the data Any subdivision thereofwould decrease the compression ratio and thus the speedupbut still allow for recovery of the true breakpoints In addi-tion such a segmentation would yield sufficient statisticsfor the likelihood computation that corresponds to the trueparameters of the state generating a segment Using wavelettheory we show that such a block structure can be easilyobtained

Wavelet theory preliminaries

Here we review some essential wavelet theory for detailssee [74 75] Let

ψ(x) =

1 0le x lt 12

minus1 12 le x lt 1

0 elsewhere

be the Haar wavelet [76] and ψ jk(x) = 2 j2ψ(2 j x minus k) jand k are called the scale and shift parameter Any square-integrable function over the unit interval f isin L2([01)) canbe approximated using the orthonormal basis ψ jk | j k isinZminus1le j 0le k le 2 jminus1 admitting a multiresolution analy-sis [77 78] Essentially this allows us to express a functionf (x) using scaled and shifted copies of one simple basisfunction ψ(x) which is spatially localized i e non-zero ononly a finite interval in x The Haar basis is particularlysuited for expressing piecewise constant functions

Finite data y = (y0 yTminus1) can be treated as an equidis-tant sample f (x) by scaling the indices to the unit intervalusing x t = t

T Let h = log2 T Then y can be expressedexactly as a linear combination over the Haar wavelet basisabove restricted to the maximum level of sampling resolu-tion ( j le hminus 1)

yt =sum

jk

d jkψ jk(x t)

The wavelet transform d =W y is an orthogonal endomor-phism and thus incurs neither redundancy nor loss of in-formation Surprisingly d can be computed in linear timeusing the pyramid algorithm [77]

Compression via wavelet shrinkage

The Haar wavelet transform has an important property con-necting it to block creation Let d be a vector obtained

by setting elements of d to zero then y =W ᵀd = W ᵀd iscalled selective wavelet reconstruction (SW) If all coefficientsd jk with ψ jk(x t) 6= ψ jk(x t+1) are set to zero for some tthen yt = yt+1 which implies a block structure on y Con-versely blocks of size gt 2 (to account for some pathologicalcases) can only be created using SW This general equivalencebetween SW and compression is central to our method Notethat y does not have to be computed explicitly the blockboundaries can be inferred from the position of zero-entriesin d alone

Assume all HMM states had the same emission varianceσ2 Since each state is associated with an emission meanfinding q can be viewed as a regression or smoothing prob-lem of finding an estimate micro of a piecewise constant functionmicro whose range is precisely the set of emission means i e

micro= f (x) y= f (x) + ε εsimiid N(0σ2)

Unfortunately regression methods typically do not limitthe number of distinct values recovered and will insteadreturn some estimate y 6= micro However if y is piecewiseconstant and minimizes microminus y the sample means of eachblock are close to the true emission means This yieldshigh likelihood for their corresponding state and hence astrong posterior distribution leading to fast convergenceFurthermore the change points in micromust be close to changepoints in y since moving block boundaries incurs additionalloss allowing for approximate recovery of true breakpointsy might however induce additional block boundaries thatreduce the compression ratio

In a series of ground-breaking papers Donoho Johnstoneet al [79ndash83] showed that SW could in theory be used as analmost ideal spatially adaptive regression method Assum-ing one could provide an oracle ∆(microy) that would knowthe true micro then there exists a method MSW(y∆) = W ᵀSWusing an optimal subset of wavelet coefficients provided by∆ such that the quadratic risk of ySW = W ᵀSWd is boundedas

microminus ySW22 = O

σ2 ln TT

By definition MSW would be the best compression methodunder the constraints of the Haar wavelet basis This boundis generally unattainable since the oracle cannot be queriedInstead they have shown that for a method MWCT(yλσ)called wavelet coefficient thresholding which sets coefficientsto zero whose absolute value is smaller than some thresholdλσ there exists some λT le

p2 ln T with yWCT = W ᵀWCTd

such that

microminus yWCT22 le (2 ln T + 1)

microminus ySW22 +σ2

T

This λT is minimax i e the maximum risk incured overall possible data sets is not larger than that of any otherthreshold and no better bound can be obtained It is noteasily computed but for large T on the order of tens tohundreds the universal threshold λu

T =p

2 ln T is asymp-totically minimax In other words for data large enough

11

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

to warrant compression universal thresholding is the bestmethod to approximate micro and thus the best wavelet-basedcompression scheme for a given noise level σ2

Integrating wavelet shrinkage into FBG

This compression method can easily be extended to multipleemission variances Since we use a thresholding methoddecreasing the variance simply subdivides existing blocks Ifthe threshold is set to the smallest emission variance amongall states y will approximately preserve the breakpointsaround those low-variance segments Those of high vari-ance are split into blocks of lower sample variance see[84 85] for an analytic expression While the variances forthe different states are not known FBG provides a priorisamples in each iteration We hence propose the follow-ing simple adaptation In each sampling iteration use thesmallest sampled variance parameter to create a new blocksequence via wavelet thresholding (Algorithm 1)

Algorithm 1 Dynamically adaptive FBG for HMMs

1 procedure HAMMLET(y πA πθ πq)2 T larr |y |3 λlarr

p2 ln T

4 A sim P(A |πA )5 θ sim P(θ |πθ )6 for i = 1 N do7 σminlarrminσi

σMADσi |σ2i isin θ

8 Create block sequence B from threshold λσmin9 qsim P(A Bθ |q)P(q |πq)

10 A sim P(q |A )P(A |πA )11 θ sim P(qB |θ )P(θ |πθ )12 Add count of marginal states for q to result13 end for14 end procedure

While intuitively easy to understand provable guaranteesfor the optimality of this method specifically the correspon-dence between the wavelet and the HMM domain remainan open research topic A potential problem could ariseif all sampled variances are too large In this case blockswould be under-segmented yield wrong posterior variancesand hide possible state transitions As a safeguard againstover-compression we use the standard method to estimatethe variance of constant noise in shrinkage applications

σ2MAD =

MADkdlog2 Tminus1k

Φminus1

34

2

as an estimate of the variance in the dominant com-ponent and modify the threshold definition to λ middotmin σMADσi isin θ If the data is not iid σ2

MAD will sys-tematically underestimate the true variance [24] In thiscase the blocks get smaller than necessary thus decreasingthe compression

A data structure for dynamic compression

The necessity to recreate a new block sequence in each it-eration based on the most recent estimate of the smallestvariance parameter creates the challenge of doing so with lit-tle computational overhead specifically without repeatedlycomputing the inverse wavelet transform or considering allT elements in other ways We achieve this by creating asimple tree-based data structure

The pyramid algorithm yields d sorted according to ( j k)Again let h = log2 T and ` = h minus j We can map thewavelet ψ jk to a perfect binary tree of height h such thatall wavelets for scale j are nodes on level ` nodes withineach level are sorted according to k and ` is increasingfrom the leaves to the root (Fig 7) d represents a breadth-first search (BFS) traversal of that tree with d jk being theentry at position b2 jc+ k Adding yi as the i-th leaf on level` = 0 each non-leaf node represents a wavelet which isnon-zero for the n = 2` data points yt for t in the theinterval I jk = [kn (k+ 1)nminus 1] stored in the leaves belownotice that for the leaves kn= t

This implies that the leaves in any subtree all have thesame value after wavelet thresholding if all the wavelets inthis subtree are set to zero We can hence avoid computingthe inverse wavelet transform to create blocks Instead eachnode stores the maximum absolute wavelet coefficient in theentire subtree as well as the sufficient statistics required forcalculating the likelihood function More formally a nodeN`t corresponds to wavelet ψ jk with ` = hminus j and t = k2`

(ψminus10 is simply constant on the [01) interval and has noeffect on block creation thus we discard it) Essentially `numbers the levels beginning at the leaves and t marks thestart position of the block when pruning the subtree rootedat N`t The members stored in each node are

bull The number of leaves corresponding to the block size

N`t[n] = 2`

bull The sum of data points stored in the subtree leaves

N`t[Σ1] =sum

iisinI jk

yi

bull Similarly the sum of squares

N`t[Σ2] =sum

iisinI jk

y2i

bull The maximum absolute wavelet coefficient of the sub-tree including the current d jk itself

N0t[d] = 0 N`gt0t[d] =max`primele`

tlet primeltt+2`

dhminus`prime2`primet prime

All these values can be computed recursively from the childnodes in linear time As some real data sets contain salt-and-pepper noise which manifests as isolated large coefficients

12

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

on the lowest level its is possible to ignore the first level inthe maximum computation so that no information to createa single-element block for outliers is passed up the tree Werefer to this technique as noise control Notice that this doesnot imply that blocks are only created at even t since truetransitions manifest in coefficients on multiple levels

The block creation algorithm is simple upon constructionthe tree is converted to depth-first search (DFS) order whichsimply amounts to sorting the BFS array according to (kn j)Given a threshold the tree is then traversed in DFS order byiterating linearly over the array (Fig 7 solid lines) Once themaximum coefficient stored in a node is less or equal to thethreshold a block of size n is created and the entire subtreeis skipped (dashed lines) As the tree is perfect binary andcomplete the next array position in DFS traversal afterpruning the subtree rooted at the node at index i is simplyobtained as i + 2nminus 1 so no expensive pointer structureneeds to be maintained leaving the tree data structure asimple flat array An example of dynamic block creation isgiven in Fig 8

Once the Gibbs sampler converges to a set of variancesthe block structure is less likely to change To avoid recre-ating the same block structure over and over again weemploy a technique called block structure prediction Sincethe different block structures are subdivisions of each otherthat occur in a specific order for decreasing σ2 there is asimple bijection between the number of blocks and the blockstructure itself Thus for each block sequence length weregister the minimum and maximum variance that createsthat sequence Upon entering a new iteration we checkif the current variance would create the same number ofblocks as in the previous iteration which guarantees thatwe would obtain the same block sequence and hence canavoid recomputation

The wavelet tree data structure can be readily extendedto multivariate data of dimensionality m Instead of storingm different trees and reconciling m different block patternsin each iteration one simply stores m different values foreach sufficient statistic in a tree node Since we have totraverse into the combined tree if the coefficient of any ofthe m trees was below the threshold we simply store thelargest N`t[d] among the corresponding nodes of the treeswhich means that the block creation can be done in O(T)instead of O(mT ) i e the dimensionality of the data onlyenters into the creation of the data structure but not thequery during sampling iterations

Automatic priors

While Bayesian methods allow for inductive bias such asthe expected location of means it is desirable to be able touse our method even when little domain knowledge existsor large variation is expected such as the lab and batcheffects commonly observed in micro-arrays [86] as wellas unknown means due to sample contamination SinceFBG does require a prior even in that case we propose the

0 5 10 15Starting position of block (t)

1

2

4

8

16

Bloc

k si

ze (n

)

DFS array (i+1)wavelet treepruning jumps (i+2n-1)

0

1

2

3

4

Leve

l ()

Figure 7 Mapping of wavelets ψ jk and data points yt to treenodes N`t Each node is the root of a subtree with n= 2` leavespruning that subtree yields a block of size n starting at positiont For instance the node N16 is located at position 13 of the DFSarray (solid line) and corresponds to the wavelet ψ33 A block ofsize n = 2 can be created by pruning the subtree which amountsto advancing by 2nminus 1= 3 positions (dashed line) yielding N38

at position 16 which is the wavelet ψ11 Thus the number ofsteps for creating blocks per iteration is at most the number ofnodes in the tree and thus strictly smaller than 2T

1248

163264

128256

Bloc

k si

ze

0 50 100 150 200 250Position along chromosome

10

05

00

05

10

15

20

25

Mea

sure

men

t

Figure 8 Example of dynamic block creation The data is of sizeT = 256 so the wavelet tree contains 512 nodes Here only 37entries had to be checked against the threshold (dark line) 19of which (round markers) yielded a block (vertical lines on thebottom) Sampling is hence done on a short array of 19 blocksinstead of 256 individual values thus the compression ratio is135 The horizontal lines in the bottom subplot are the blockmeans derived from the sufficient statistics in the nodes Noticehow the algorithm creates small blocks around the breakpointse g at t asymp 125 which requires traversing to lower levels andthus induces some additional blocks in other parts of the tree(left subtree) since all block sizes are powers of 2 This some-what reduces the compression ratio which is unproblematic as itincreases the degrees of freedom in the sampler

13

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

following method to specify hyperparameters of a weak priorautomatically Posterior samples of means and variances aredrawn from a Normal-Inverse Gamma distribution (microσ2)simNIΓ(micro0ναβ) whose marginals simply separate into aNormal and an Inverse Gamma distribution

σ2 sim IΓ(αβ) microsim N

micro0σ2

ν

Let s2 be a user-defined variance (or automatically inferede g from the largest of the finest detail coefficients orσ2

MAD) and p the desired probability to sample a variancenot larger than s2 From the CDF of IΓ we obtain

p = P(σ2 le s2) =Γ

α βs2

Γ (α)=Q

αβ

s2

IΓ has a mean for α gt 1 and closed-form solutions forα isin N Furthermore IΓ has positive skewness for α gt 3We thus let α= 2 which yields

β = minuss2

Wminus1

minuspe

+ 1

0lt p le 1

where Wminus1 is the negative branch of the Lambert W -functionwhich is transcendental However an excellent analyticalapproximation with a maximum error of 0025 is given in[87] which yields

β asymp s2

2p

b

M1

pb+p

2

M2 b exp

M3

pb

+ 1 + b

b = minus ln p

M1 = 03361 M2 = minus00042 M3 = minus00201

Since the mean of IΓ is βαminus1 the expected variance of micro is βν

for α= 2 To ensure proper mixing we could simply set βmicroto the sample variance of the data which can be estimatedfrom the sufficient statistics in the root of the wavelet tree(the first entry in the array) provided that micro contained allstates in almost equal number However due to possibleclass imbalance means for short segments far away from micro0can have low sampling probability as they do not contributemuch to the sample variance of the data We thus define δ tobe the sample variance of block means in the compressionobtained by σ2

MAD and take the maximum of those twovariances We thus obtain

micro0 =Σ1

n and ν= βmax

nΣ2 minusΣ21

n2δ

minus1

Numerical issues

To assure numerical stability many HMM implementationsresort to log-space computations Our implementationwhich differs from [62] reflects the block structure andavoids a considerable number of expensive function calls(exp log pow) required by others This yields an additional

speedup of about one order of magnitude compared to thelog version (data not shown) The term accounting for emis-sions and self-transitions within the block can be writtenas

Anwminus1j j

(2π)nw2σnwj

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

Any constant cancels out during normalization Further-more exponentiation of potentially small numbers causesunderflows We hence move those terms into the exponentutilizing the much stabler logarithm function

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

+ (nw minus 1) log A j j minus nw logσ j

Using the blockrsquos sufficient statistics

nw Σ1 =nwsum

k=1

ywk Σ2 =nwsum

k=1

y2wk

the exponent can be rewritten as

Ew( j) =2micro jΣ1 minusΣ2

2σ2j

+ K(nw j)

K(nw j) = (nw minus 1) log A j j minus nw

logσ j +micro2

j

2σ2j

Note that K(nw j) can be precomputed for each iterationthus greatly reducing the number of expensive function callsTo avoid overflow of the exponential function we subtractthe largest such exponents among all states hence Ew( j)le0 This is equivalent to dividing the forward variables by

exp

maxk

Ew(k)

which cancels out during normalization Hence we obtain

αw( j) = exp

Ew( j)minusmaxk

Ew(k)

nwsum

i=1

αwminus1(i)Ai j

which are then normalized to

αw( j) =αw( j)

sum

k αw(k)

Availability of supporting data

The implementation as well as the supplemental data areavailable at httpbioinformaticsrutgerseduSoftwareHaMMLET

Competing interests

The authors declare that the research was conducted in theabsence of any commercial or financial relationships thatcould be construed as a potential conflict of interest

14

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 5: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

Figure 3 Copy number inference for chromosome 20 in invasive ductal carcinoma (21687 probes) CBS creates a 19-state solution(top) however a compressed 19-state HMM only supports an 11-state solution (bottom) suggesting insufficient level merging in CBS

5

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Speed and convergence effects of wavelet com-pression

The speedup gained by compression depends on how wellthe data can be compressed Poor compression is expectedwhen the means are not well separated or short segmentshave small variance which necessitates the creation ofsmaller blocks for the rest of the data to expose potentiallow-variance segments to the sampler On the other handdata must not be over-compressed to avoid merging smallaberrations with normal segments which would decreasethe F-measure The level of compression is measured withthe compression ratio Due to the dynamic changes to theblock structure we define compression ratio as the totalnumber of blocks in all iterations divided by the productof the number of data points T and the number of itera-tions N As usual a compression ratio of one indicates nocompression

To evaluate the impact of dynamic wavelet compressionon speed and convergence properties of an HMM wecreated 129600 different data sets with T = 32768 manyprobes In each data set we randomly distributed 1 to6 gains of a total length of 1002505007501000uniformly among the data and do the same for lossesMean combinations (microlossmicroneutralmicrogain) were cho-sen from (log2

12 log2 1 log2

32 ) (minus101) (minus202)

and (minus10 0 10) and variances (σ2lossσ

2neutralσ

2gain)

from (005005005) (050109) (030201)(020103) (010302) and (010101) Thesevalues have been selected to yield a wide range of easyand hard cases both well separated low-variance datawith large aberrant segments as well as cases in whichsmall aberrations overlap significantly with the tailsamples of high-variance neutral segments Consequentlycompression ratios range from sim1 to sim2100 We useautomatic priors P(σ2 le 02) = 09 self-transition priorsαii isin 101001000 non-self transition priors αi j = 1and initial state priors α isin 110 Using all possiblecombinations of the above yields 129600 differentsimulated data sets a total of 42 billion values

We achieve speedups per iteration of up to 350 com-pared to an uncompressed HMM (Fig 5) In contrast [62]have reported ratios of 10ndash60 with one instance of 90 No-tice that the speedup is not linear in the compression ratioWhile sampling itself is expected to yield linear speedup themarginal counts still have to be tallied individually for eachposition and dynamic block creation causes some overheadThe quantization artifacts observed for larger speedup arelikely due to the limited resolution of the Linux time com-mand (10 milliseconds) Compressed HaMMLET took about112 CPU hours for all 129600 simulations whereas theuncompressed version took over 3 weeks and 5 days Allrunning times reported are CPU time measured on a singlecore of a AMD OpteronTM 6174 Processor clocked at 22GHz

We evaluate the convergence of the F-measure of com-

Figure 5 HaMMLETrsquos speedup as a function of the average com-pression during sampling As expected higher compression leadsto greater speedup The non-linear characteristic is due to thefact that some overhead is incurred by the dynamic compressionas well as parts of the implementation that do not depend on thecompression such as tallying marginal counts

pressed and uncompressed inference for each simulationSince we are dealing with multi-class classification we usethe micro- and macro-averaged F-measures (Fmi Fma) pro-posed by [71] they tend to be dominated by the classifierrsquosperformance on common and rare categories respectivelySince all state labels are sampled from the same prior andhence their relative order is random we used the labelpermutation which yielded the highest sum of micro- andmacro-averaged F-measures

In Fig 4 we show that the compressed version of theGibbs sampler converges almost instantly whereas the un-compressed version converges much slower with about5 of the cases failing to yield an F-measure gt 06 within1000 iterations Wavelet compression is likely to yield rea-sonably large blocks for the majority class early on whichleads to a strong posterior estimate of its parameters andself-transition probabilities As expected Fma are generallyworse since any misclassification in a rare class has a largerimpact Especially in the uncompressed version we observethat Fma tends to plateau until Fmi approaches 10 Sinceany misclassification in the majority (neutral) class addsfalse positives to the minority classes this effect is expectedIt implies that correct labeling of the majority class is a nec-essary condition for correct labeling of minority classes inother words correct identification of the rare interestingsegments requires the sampler to properly converge whichis much harder to achieve without compression It shouldbe noted that running compressed HaMMLET for 1000iterations is unnecessary on the simulated data as in allcases it converges between 25 and 50 iterations Thus forall practical purposes an additional speedup of 40ndash80 canbe achieved by reducing the number of iterations whichyields convergence up to 3 orders of magnitude faster thanstandard FBG

6

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

00

02

04

06

08

10

Mic

ro-a

vera

ged F

-measu

re

Uncompressed Compressed

0 200 400 600 800 1000Iteration

00

02

04

06

08

10

Macr

o-a

vera

ged F

-measu

re

0 200 400 600 800 1000Iteration

0 25

08

10

0 25

08

10

Figure 4 The median value (black) and quantile ranges (in 5 steps) of the micro- (top) and macro-averaged (bottom) F-measuresfor uncompressed (left) and compressed (right) FBG inference on the same 129600 simulated data sets using automatic priors Thex-axis represents the number of iterations alone and does not reflect the additional speedup obtained through compression Noticethat the compressed HMM converges no later than 50 iterations (inset figures right)

7

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Coriell ATCC and breast carcinoma

The data provided by [72] includes 15 aCGH samples forthe Coriell cell line At about 2000 probes the data issmall compared to modern high-density arrays Neverthe-less these data sets have become a common standard toevaluate CNV calling methods as they contain few and sim-ple aberrations The data also contains 6 ATCC cell lines aswell as 4 breast carcinoma all of which are substantiallymore complicated and typically not used in software eval-uations In Fig 6 we demonstrate our ability to infer thecorrect segments on the most complex example a T47Dbreast ductal carcinoma sample of a 54 year old female Weused 6-state automatic priors with P(σ2 le 01) = 085 andall Dirichlet hyperparameters set to 1 On a standard laptopHaMMLET took 009 seconds for 1000 iterations runningtimes for the other samples were similar Our results for all25 data sets have been included in the supplement

Conclusions

In the analysis of biological data there are usually con-flicting objectives at play which need to be balanced therequired accuracy of the analysis ease of usemdashusing thesoftware setting software and method parametersmdashandoften the speed of a method Bayesian methods have ob-tained a reputation of requiring enormous computationaleffort and being difficult to use for the expert knowledgerequired for choosing prior distributions It has also beenrecognized [60 62 73] that they are very powerful andaccurate leading to improved high-quality results and pro-viding in the form of posterior distributions an accuratemeasure of uncertainty in results Nevertheless it is not sur-prising that a hundred times larger effort in computationaleffort alone prevented wide-spread use

Inferring Copy Number Variants (CNV) is a quite specialproblem as experts can identify CN changes visually atleast on very good data and for large drastic CN changes(e g long segments lost on both chromosomal copies)With lesser quality data smaller CN differences and in theanalysis of cohorts the need for objective highly accurateand automated methods are evident

The core idea for our method expands on our prior work[62] and affirms a conjecture by Lai et al [65] that a com-bination of smoothing and segmentation will likely improveresults One ingredient of our method are Haar waveletswhich have previously been used for pre-processing andvisualization [17 64] In a sense they quantify and identifyregions of high variation and allow to summarize the dataat various levels of resolution somewhat similar to howan expert would perform a visual analysis We combinefor the first time wavelets with a full Bayesian HMM bydynamically and adaptively infering blocks of subsequentobservations from our wavelet data structure The HMMoperates on blocks instead of individual observations whichleads to great saving in running times up to 350-fold com-

pared to the standard FB-Gibbs sampler and up to 288times faster than CBS Much more importantly operatingon the blocks greatly improves convergence of the samplerleading to higher accuracy for a much smaller number ofsampling iterations Thus the combination of wavelets andHMM realizes a simultaneous improvement on accuracyand on speed typically one can have one or the other Anintuitive explanation as to why this works is that the blocksderived from the wavelet structure allow efficient summarytreatment of those ldquoeasyrdquo to call segments given the currentsample of HMM parameters and identifies ldquohardrdquo to callCN segment which need the full computational effort fromFB-Gibbs Note that it is absolutely crucial that the blockstructure depends on the parameters sampled for the HMMand will change drastically over the run time This is incontrast to our prior work [62] which used static blocksand saw no improvements to accuracy and convergencespeed The data structures and linear-time algorithms weintroduce here provide the efficient means for recomputingthese blocks at every cycle of the sampling cf Fig 1 Com-pared to our prior work we observe a speed-up of up to3000 due to the greatly improved convergence O(T) vsO(T log T ) clustering improved numerics and lastly a C++instead of a Python implementation

We performed an extensive comparison with the state-of-the-art as identified by several review and benchmarkpublications and with the standard FB-Gibbs sampler on awide range of biological data sets and 129600 simulateddata sets which were produced by a simulation processnot based on HMM to make it a harder problem for ourmethod All comparisons demonstrated favorable resultsfor our method when measuring accuracy at a very notice-able acceleration compared to the state-of-the-art It mustbe stressed that these results were obtained with a statisti-cally sophisticated model for CNV calls and without cuttingalgorithmic corners but rather an effective allocation ofcomputational effort

All our computations are performed using our automaticprior which derives estimates for the hyperparameters ofthe priors for means and variances directly from the wavelettree structure and the resulting blocks The block structurealso imposes a prior on self-transition probabilities Theuser only has to provide an estimate of the noise variancebut as the automatic prior is designed to be weak the priorand thus the method is robust against incorrect estimatesWe have demonstrated this by using different hyperparam-eters for the associated Dirichlet priors in our simulationswhich HaMMLET is able to infer correctly regardless ofthe transition priors At the same time the automatic priorcan be used to tune certain aspects of the HMM if strongerprior knowledge is available We would expect further im-provements from non-automatic expert-selected priors butrefrained from using them for the evaluation as they mightbe perceived as unfair to other methods

In summary our method is as easy to use as other sta-tistically less sophisticated tools more accurate and much

8

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Figure 6 HaMMLETrsquos inference of copy-number segments on T47D breast ductal carcinoma Notice that the data is much morecomplex than the simple structure of a diploid majority class with some small aberrations typically observed for Coriell data

9

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

more computationally efficient We believe this makes itan excellent choice both for routine use in clinical settingsand principled re-analysis of large cohorts where the addedaccuracy and the much improved information about uncer-tainty in copy number calls from posterior marginal distri-butions will likely yield improved insights into CNV as asource of genetic variation and its relationship to disease

Methods

We will briefly review Forward-Backward Gibbs sampling(FBG) for Bayesian Hidden Markov Models and its accel-eration through compression of the data into blocks Byfirst considering the case of equal emission variances amongall states we show that optimal compression is equivalentto a concept called selective wavelet reconstruction follow-ing a classic proof in wavelet theory We then argue thatwavelet coefficient thresholding a variance-dependent mini-max estimator allows for compression even in the case ofunequal emission variances This allows the compression ofthe data to be adapted to the prior variance level at eachsampling iteration We then derive a simple data structureto dynamically create blocks with little overhead Whilewavelet approaches have been used for aCGH data before[25 29 30 63] our method provides the first combinationof wavelets and HMMs

Bayesian Hidden Markov Models

Let T be the length of the observation sequence which isequal to the number of probes An HMM can be representedas a statistical model (qA θ |y) with transition matrixA a latent state sequence q = (q0 q1 qTminus1) an observedemission sequence y = (y0 y1 yTminus1) and the emissionparameters θ The initial distribution of q0 is often includedand denoted as π but in a Bayesian setting it makes moresense to take π as a set of hyperparameters for q0

In the usual frequentist approach the state sequence q isinferred by first finding a maximum likelihood estimate ofthe parameters

(AMLθML) = arg max(A θ )

L (A θ |y)

using the Baum-Welsh algorithm [53 54] This is only guar-anteed to yield local optima as the likelihood function isnot convex Repeated random reinitializations are used tofind ldquogoodrdquo local optima but there are no guarantees forthis method Then the most likely state sequence giventhose parameters

q= argmaxqP(q |AMLθMLy)

is calculated using the Viterbi algorithm [55] However ifthere are only a few aberrations that is there is imbalancebetween classes the ML parameters tend to overfit the nor-mal state which is likely to yield incorrect segmentation

[62] Furthermore alternative segmentations given thoseparameters are also ignored as are the ones for alternativeparameters

The Bayesian approach is to calculate the distribution ofstate sequences directly by integrating out the emission andtransition variables

P(q |y) =int

A

int

θ

P(qA θ |y)dθ dA

Since this integral is intractable it has to be approximatedusing Markov Chain Monte Carlo techniques i e drawingN samples

(q(i)A (i)θ (i))sim P(qA θ |y)

and subsequently approximating marginal state probabili-ties by their frequency in the sample

P(qt = s |y)asymp1N

Nsum

i=1

I(q(i)t = s)

Thus for each position t we get a complete probabilitydistribution over the possible states As the marginals ofeach variable are explicitly defined by conditioning on theother variables an HMM lends itself to Gibbs samplingi e repeatedly sampling from the marginals (q |A θ y)(A |qθ y) and (θ |qA y) conditioned on the previouslysampled values Using Bayesrsquo formula and several condi-tional independence relations the sampling process can bewritten as

A sim P(A |q) prop P(q |A )P(A |πA )

θ sim P(θ |qy) prop P(qy |θ )P(θ |πθ ) and

qsim P(q |A yθ ) prop P(A yθ |q)P(q |πq)

where π represents hyperparameters to the prior distribu-tion Notice that q depends on a prior only in q0 so πq is aset of parameters from which the distribution of q0 is sam-pled (typically πq are parameters of a Dirichlet distribution)In non-Bayesian settings π often denotes P(q0) itself Typi-cally each prior will be conjugate i e it will be the sameclass of distributions as the posterior Thus πS and πA(k)the hyperparameters of q0 and the k-th row ofA will bethe αi of a Dirichlet distribution and πθ = (αβ νmicro0) willbe the parameters of a Normal-Inverse Gamma distribution

Though there are several sampling schemes available[58] has argued strongly in favor of Forward-BackwardGibbs sampling (FBG) [57] Variations of this have beenimplemented for segmentation of aCGH data before [6062 73] However sampling-based Bayesian methods suchas FBG are computationally expensive Recently [62] haveintroduced compressed FBG by sampling over a shortersequence of sufficient statistics of data segments whichare likely to come from the same underlying state LetB = (Bw)Ww=1 be a partition of y into W blocks Each blockBw contains nw elements Let ywk the k-th element in Bw

10

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

The forward variable αw( j) for this block needs to take intoaccount the nw emissions the transitions into state j andthe nw minus 1 self-transitions which yields

αw( j) = Anwminus1j j P(Bw |micro j σ

2j )

nwsum

i=1

αwminus1(i)Ai j and

P(Bw |microσ2) =nwprod

k=1

P(ywk |microσ2)

The ideal block structure would correspond to the actualunknown segmentation of the data Any subdivision thereofwould decrease the compression ratio and thus the speedupbut still allow for recovery of the true breakpoints In addi-tion such a segmentation would yield sufficient statisticsfor the likelihood computation that corresponds to the trueparameters of the state generating a segment Using wavelettheory we show that such a block structure can be easilyobtained

Wavelet theory preliminaries

Here we review some essential wavelet theory for detailssee [74 75] Let

ψ(x) =

1 0le x lt 12

minus1 12 le x lt 1

0 elsewhere

be the Haar wavelet [76] and ψ jk(x) = 2 j2ψ(2 j x minus k) jand k are called the scale and shift parameter Any square-integrable function over the unit interval f isin L2([01)) canbe approximated using the orthonormal basis ψ jk | j k isinZminus1le j 0le k le 2 jminus1 admitting a multiresolution analy-sis [77 78] Essentially this allows us to express a functionf (x) using scaled and shifted copies of one simple basisfunction ψ(x) which is spatially localized i e non-zero ononly a finite interval in x The Haar basis is particularlysuited for expressing piecewise constant functions

Finite data y = (y0 yTminus1) can be treated as an equidis-tant sample f (x) by scaling the indices to the unit intervalusing x t = t

T Let h = log2 T Then y can be expressedexactly as a linear combination over the Haar wavelet basisabove restricted to the maximum level of sampling resolu-tion ( j le hminus 1)

yt =sum

jk

d jkψ jk(x t)

The wavelet transform d =W y is an orthogonal endomor-phism and thus incurs neither redundancy nor loss of in-formation Surprisingly d can be computed in linear timeusing the pyramid algorithm [77]

Compression via wavelet shrinkage

The Haar wavelet transform has an important property con-necting it to block creation Let d be a vector obtained

by setting elements of d to zero then y =W ᵀd = W ᵀd iscalled selective wavelet reconstruction (SW) If all coefficientsd jk with ψ jk(x t) 6= ψ jk(x t+1) are set to zero for some tthen yt = yt+1 which implies a block structure on y Con-versely blocks of size gt 2 (to account for some pathologicalcases) can only be created using SW This general equivalencebetween SW and compression is central to our method Notethat y does not have to be computed explicitly the blockboundaries can be inferred from the position of zero-entriesin d alone

Assume all HMM states had the same emission varianceσ2 Since each state is associated with an emission meanfinding q can be viewed as a regression or smoothing prob-lem of finding an estimate micro of a piecewise constant functionmicro whose range is precisely the set of emission means i e

micro= f (x) y= f (x) + ε εsimiid N(0σ2)

Unfortunately regression methods typically do not limitthe number of distinct values recovered and will insteadreturn some estimate y 6= micro However if y is piecewiseconstant and minimizes microminus y the sample means of eachblock are close to the true emission means This yieldshigh likelihood for their corresponding state and hence astrong posterior distribution leading to fast convergenceFurthermore the change points in micromust be close to changepoints in y since moving block boundaries incurs additionalloss allowing for approximate recovery of true breakpointsy might however induce additional block boundaries thatreduce the compression ratio

In a series of ground-breaking papers Donoho Johnstoneet al [79ndash83] showed that SW could in theory be used as analmost ideal spatially adaptive regression method Assum-ing one could provide an oracle ∆(microy) that would knowthe true micro then there exists a method MSW(y∆) = W ᵀSWusing an optimal subset of wavelet coefficients provided by∆ such that the quadratic risk of ySW = W ᵀSWd is boundedas

microminus ySW22 = O

σ2 ln TT

By definition MSW would be the best compression methodunder the constraints of the Haar wavelet basis This boundis generally unattainable since the oracle cannot be queriedInstead they have shown that for a method MWCT(yλσ)called wavelet coefficient thresholding which sets coefficientsto zero whose absolute value is smaller than some thresholdλσ there exists some λT le

p2 ln T with yWCT = W ᵀWCTd

such that

microminus yWCT22 le (2 ln T + 1)

microminus ySW22 +σ2

T

This λT is minimax i e the maximum risk incured overall possible data sets is not larger than that of any otherthreshold and no better bound can be obtained It is noteasily computed but for large T on the order of tens tohundreds the universal threshold λu

T =p

2 ln T is asymp-totically minimax In other words for data large enough

11

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

to warrant compression universal thresholding is the bestmethod to approximate micro and thus the best wavelet-basedcompression scheme for a given noise level σ2

Integrating wavelet shrinkage into FBG

This compression method can easily be extended to multipleemission variances Since we use a thresholding methoddecreasing the variance simply subdivides existing blocks Ifthe threshold is set to the smallest emission variance amongall states y will approximately preserve the breakpointsaround those low-variance segments Those of high vari-ance are split into blocks of lower sample variance see[84 85] for an analytic expression While the variances forthe different states are not known FBG provides a priorisamples in each iteration We hence propose the follow-ing simple adaptation In each sampling iteration use thesmallest sampled variance parameter to create a new blocksequence via wavelet thresholding (Algorithm 1)

Algorithm 1 Dynamically adaptive FBG for HMMs

1 procedure HAMMLET(y πA πθ πq)2 T larr |y |3 λlarr

p2 ln T

4 A sim P(A |πA )5 θ sim P(θ |πθ )6 for i = 1 N do7 σminlarrminσi

σMADσi |σ2i isin θ

8 Create block sequence B from threshold λσmin9 qsim P(A Bθ |q)P(q |πq)

10 A sim P(q |A )P(A |πA )11 θ sim P(qB |θ )P(θ |πθ )12 Add count of marginal states for q to result13 end for14 end procedure

While intuitively easy to understand provable guaranteesfor the optimality of this method specifically the correspon-dence between the wavelet and the HMM domain remainan open research topic A potential problem could ariseif all sampled variances are too large In this case blockswould be under-segmented yield wrong posterior variancesand hide possible state transitions As a safeguard againstover-compression we use the standard method to estimatethe variance of constant noise in shrinkage applications

σ2MAD =

MADkdlog2 Tminus1k

Φminus1

34

2

as an estimate of the variance in the dominant com-ponent and modify the threshold definition to λ middotmin σMADσi isin θ If the data is not iid σ2

MAD will sys-tematically underestimate the true variance [24] In thiscase the blocks get smaller than necessary thus decreasingthe compression

A data structure for dynamic compression

The necessity to recreate a new block sequence in each it-eration based on the most recent estimate of the smallestvariance parameter creates the challenge of doing so with lit-tle computational overhead specifically without repeatedlycomputing the inverse wavelet transform or considering allT elements in other ways We achieve this by creating asimple tree-based data structure

The pyramid algorithm yields d sorted according to ( j k)Again let h = log2 T and ` = h minus j We can map thewavelet ψ jk to a perfect binary tree of height h such thatall wavelets for scale j are nodes on level ` nodes withineach level are sorted according to k and ` is increasingfrom the leaves to the root (Fig 7) d represents a breadth-first search (BFS) traversal of that tree with d jk being theentry at position b2 jc+ k Adding yi as the i-th leaf on level` = 0 each non-leaf node represents a wavelet which isnon-zero for the n = 2` data points yt for t in the theinterval I jk = [kn (k+ 1)nminus 1] stored in the leaves belownotice that for the leaves kn= t

This implies that the leaves in any subtree all have thesame value after wavelet thresholding if all the wavelets inthis subtree are set to zero We can hence avoid computingthe inverse wavelet transform to create blocks Instead eachnode stores the maximum absolute wavelet coefficient in theentire subtree as well as the sufficient statistics required forcalculating the likelihood function More formally a nodeN`t corresponds to wavelet ψ jk with ` = hminus j and t = k2`

(ψminus10 is simply constant on the [01) interval and has noeffect on block creation thus we discard it) Essentially `numbers the levels beginning at the leaves and t marks thestart position of the block when pruning the subtree rootedat N`t The members stored in each node are

bull The number of leaves corresponding to the block size

N`t[n] = 2`

bull The sum of data points stored in the subtree leaves

N`t[Σ1] =sum

iisinI jk

yi

bull Similarly the sum of squares

N`t[Σ2] =sum

iisinI jk

y2i

bull The maximum absolute wavelet coefficient of the sub-tree including the current d jk itself

N0t[d] = 0 N`gt0t[d] =max`primele`

tlet primeltt+2`

dhminus`prime2`primet prime

All these values can be computed recursively from the childnodes in linear time As some real data sets contain salt-and-pepper noise which manifests as isolated large coefficients

12

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

on the lowest level its is possible to ignore the first level inthe maximum computation so that no information to createa single-element block for outliers is passed up the tree Werefer to this technique as noise control Notice that this doesnot imply that blocks are only created at even t since truetransitions manifest in coefficients on multiple levels

The block creation algorithm is simple upon constructionthe tree is converted to depth-first search (DFS) order whichsimply amounts to sorting the BFS array according to (kn j)Given a threshold the tree is then traversed in DFS order byiterating linearly over the array (Fig 7 solid lines) Once themaximum coefficient stored in a node is less or equal to thethreshold a block of size n is created and the entire subtreeis skipped (dashed lines) As the tree is perfect binary andcomplete the next array position in DFS traversal afterpruning the subtree rooted at the node at index i is simplyobtained as i + 2nminus 1 so no expensive pointer structureneeds to be maintained leaving the tree data structure asimple flat array An example of dynamic block creation isgiven in Fig 8

Once the Gibbs sampler converges to a set of variancesthe block structure is less likely to change To avoid recre-ating the same block structure over and over again weemploy a technique called block structure prediction Sincethe different block structures are subdivisions of each otherthat occur in a specific order for decreasing σ2 there is asimple bijection between the number of blocks and the blockstructure itself Thus for each block sequence length weregister the minimum and maximum variance that createsthat sequence Upon entering a new iteration we checkif the current variance would create the same number ofblocks as in the previous iteration which guarantees thatwe would obtain the same block sequence and hence canavoid recomputation

The wavelet tree data structure can be readily extendedto multivariate data of dimensionality m Instead of storingm different trees and reconciling m different block patternsin each iteration one simply stores m different values foreach sufficient statistic in a tree node Since we have totraverse into the combined tree if the coefficient of any ofthe m trees was below the threshold we simply store thelargest N`t[d] among the corresponding nodes of the treeswhich means that the block creation can be done in O(T)instead of O(mT ) i e the dimensionality of the data onlyenters into the creation of the data structure but not thequery during sampling iterations

Automatic priors

While Bayesian methods allow for inductive bias such asthe expected location of means it is desirable to be able touse our method even when little domain knowledge existsor large variation is expected such as the lab and batcheffects commonly observed in micro-arrays [86] as wellas unknown means due to sample contamination SinceFBG does require a prior even in that case we propose the

0 5 10 15Starting position of block (t)

1

2

4

8

16

Bloc

k si

ze (n

)

DFS array (i+1)wavelet treepruning jumps (i+2n-1)

0

1

2

3

4

Leve

l ()

Figure 7 Mapping of wavelets ψ jk and data points yt to treenodes N`t Each node is the root of a subtree with n= 2` leavespruning that subtree yields a block of size n starting at positiont For instance the node N16 is located at position 13 of the DFSarray (solid line) and corresponds to the wavelet ψ33 A block ofsize n = 2 can be created by pruning the subtree which amountsto advancing by 2nminus 1= 3 positions (dashed line) yielding N38

at position 16 which is the wavelet ψ11 Thus the number ofsteps for creating blocks per iteration is at most the number ofnodes in the tree and thus strictly smaller than 2T

1248

163264

128256

Bloc

k si

ze

0 50 100 150 200 250Position along chromosome

10

05

00

05

10

15

20

25

Mea

sure

men

t

Figure 8 Example of dynamic block creation The data is of sizeT = 256 so the wavelet tree contains 512 nodes Here only 37entries had to be checked against the threshold (dark line) 19of which (round markers) yielded a block (vertical lines on thebottom) Sampling is hence done on a short array of 19 blocksinstead of 256 individual values thus the compression ratio is135 The horizontal lines in the bottom subplot are the blockmeans derived from the sufficient statistics in the nodes Noticehow the algorithm creates small blocks around the breakpointse g at t asymp 125 which requires traversing to lower levels andthus induces some additional blocks in other parts of the tree(left subtree) since all block sizes are powers of 2 This some-what reduces the compression ratio which is unproblematic as itincreases the degrees of freedom in the sampler

13

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

following method to specify hyperparameters of a weak priorautomatically Posterior samples of means and variances aredrawn from a Normal-Inverse Gamma distribution (microσ2)simNIΓ(micro0ναβ) whose marginals simply separate into aNormal and an Inverse Gamma distribution

σ2 sim IΓ(αβ) microsim N

micro0σ2

ν

Let s2 be a user-defined variance (or automatically inferede g from the largest of the finest detail coefficients orσ2

MAD) and p the desired probability to sample a variancenot larger than s2 From the CDF of IΓ we obtain

p = P(σ2 le s2) =Γ

α βs2

Γ (α)=Q

αβ

s2

IΓ has a mean for α gt 1 and closed-form solutions forα isin N Furthermore IΓ has positive skewness for α gt 3We thus let α= 2 which yields

β = minuss2

Wminus1

minuspe

+ 1

0lt p le 1

where Wminus1 is the negative branch of the Lambert W -functionwhich is transcendental However an excellent analyticalapproximation with a maximum error of 0025 is given in[87] which yields

β asymp s2

2p

b

M1

pb+p

2

M2 b exp

M3

pb

+ 1 + b

b = minus ln p

M1 = 03361 M2 = minus00042 M3 = minus00201

Since the mean of IΓ is βαminus1 the expected variance of micro is βν

for α= 2 To ensure proper mixing we could simply set βmicroto the sample variance of the data which can be estimatedfrom the sufficient statistics in the root of the wavelet tree(the first entry in the array) provided that micro contained allstates in almost equal number However due to possibleclass imbalance means for short segments far away from micro0can have low sampling probability as they do not contributemuch to the sample variance of the data We thus define δ tobe the sample variance of block means in the compressionobtained by σ2

MAD and take the maximum of those twovariances We thus obtain

micro0 =Σ1

n and ν= βmax

nΣ2 minusΣ21

n2δ

minus1

Numerical issues

To assure numerical stability many HMM implementationsresort to log-space computations Our implementationwhich differs from [62] reflects the block structure andavoids a considerable number of expensive function calls(exp log pow) required by others This yields an additional

speedup of about one order of magnitude compared to thelog version (data not shown) The term accounting for emis-sions and self-transitions within the block can be writtenas

Anwminus1j j

(2π)nw2σnwj

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

Any constant cancels out during normalization Further-more exponentiation of potentially small numbers causesunderflows We hence move those terms into the exponentutilizing the much stabler logarithm function

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

+ (nw minus 1) log A j j minus nw logσ j

Using the blockrsquos sufficient statistics

nw Σ1 =nwsum

k=1

ywk Σ2 =nwsum

k=1

y2wk

the exponent can be rewritten as

Ew( j) =2micro jΣ1 minusΣ2

2σ2j

+ K(nw j)

K(nw j) = (nw minus 1) log A j j minus nw

logσ j +micro2

j

2σ2j

Note that K(nw j) can be precomputed for each iterationthus greatly reducing the number of expensive function callsTo avoid overflow of the exponential function we subtractthe largest such exponents among all states hence Ew( j)le0 This is equivalent to dividing the forward variables by

exp

maxk

Ew(k)

which cancels out during normalization Hence we obtain

αw( j) = exp

Ew( j)minusmaxk

Ew(k)

nwsum

i=1

αwminus1(i)Ai j

which are then normalized to

αw( j) =αw( j)

sum

k αw(k)

Availability of supporting data

The implementation as well as the supplemental data areavailable at httpbioinformaticsrutgerseduSoftwareHaMMLET

Competing interests

The authors declare that the research was conducted in theabsence of any commercial or financial relationships thatcould be construed as a potential conflict of interest

14

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 6: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

Speed and convergence effects of wavelet com-pression

The speedup gained by compression depends on how wellthe data can be compressed Poor compression is expectedwhen the means are not well separated or short segmentshave small variance which necessitates the creation ofsmaller blocks for the rest of the data to expose potentiallow-variance segments to the sampler On the other handdata must not be over-compressed to avoid merging smallaberrations with normal segments which would decreasethe F-measure The level of compression is measured withthe compression ratio Due to the dynamic changes to theblock structure we define compression ratio as the totalnumber of blocks in all iterations divided by the productof the number of data points T and the number of itera-tions N As usual a compression ratio of one indicates nocompression

To evaluate the impact of dynamic wavelet compressionon speed and convergence properties of an HMM wecreated 129600 different data sets with T = 32768 manyprobes In each data set we randomly distributed 1 to6 gains of a total length of 1002505007501000uniformly among the data and do the same for lossesMean combinations (microlossmicroneutralmicrogain) were cho-sen from (log2

12 log2 1 log2

32 ) (minus101) (minus202)

and (minus10 0 10) and variances (σ2lossσ

2neutralσ

2gain)

from (005005005) (050109) (030201)(020103) (010302) and (010101) Thesevalues have been selected to yield a wide range of easyand hard cases both well separated low-variance datawith large aberrant segments as well as cases in whichsmall aberrations overlap significantly with the tailsamples of high-variance neutral segments Consequentlycompression ratios range from sim1 to sim2100 We useautomatic priors P(σ2 le 02) = 09 self-transition priorsαii isin 101001000 non-self transition priors αi j = 1and initial state priors α isin 110 Using all possiblecombinations of the above yields 129600 differentsimulated data sets a total of 42 billion values

We achieve speedups per iteration of up to 350 com-pared to an uncompressed HMM (Fig 5) In contrast [62]have reported ratios of 10ndash60 with one instance of 90 No-tice that the speedup is not linear in the compression ratioWhile sampling itself is expected to yield linear speedup themarginal counts still have to be tallied individually for eachposition and dynamic block creation causes some overheadThe quantization artifacts observed for larger speedup arelikely due to the limited resolution of the Linux time com-mand (10 milliseconds) Compressed HaMMLET took about112 CPU hours for all 129600 simulations whereas theuncompressed version took over 3 weeks and 5 days Allrunning times reported are CPU time measured on a singlecore of a AMD OpteronTM 6174 Processor clocked at 22GHz

We evaluate the convergence of the F-measure of com-

Figure 5 HaMMLETrsquos speedup as a function of the average com-pression during sampling As expected higher compression leadsto greater speedup The non-linear characteristic is due to thefact that some overhead is incurred by the dynamic compressionas well as parts of the implementation that do not depend on thecompression such as tallying marginal counts

pressed and uncompressed inference for each simulationSince we are dealing with multi-class classification we usethe micro- and macro-averaged F-measures (Fmi Fma) pro-posed by [71] they tend to be dominated by the classifierrsquosperformance on common and rare categories respectivelySince all state labels are sampled from the same prior andhence their relative order is random we used the labelpermutation which yielded the highest sum of micro- andmacro-averaged F-measures

In Fig 4 we show that the compressed version of theGibbs sampler converges almost instantly whereas the un-compressed version converges much slower with about5 of the cases failing to yield an F-measure gt 06 within1000 iterations Wavelet compression is likely to yield rea-sonably large blocks for the majority class early on whichleads to a strong posterior estimate of its parameters andself-transition probabilities As expected Fma are generallyworse since any misclassification in a rare class has a largerimpact Especially in the uncompressed version we observethat Fma tends to plateau until Fmi approaches 10 Sinceany misclassification in the majority (neutral) class addsfalse positives to the minority classes this effect is expectedIt implies that correct labeling of the majority class is a nec-essary condition for correct labeling of minority classes inother words correct identification of the rare interestingsegments requires the sampler to properly converge whichis much harder to achieve without compression It shouldbe noted that running compressed HaMMLET for 1000iterations is unnecessary on the simulated data as in allcases it converges between 25 and 50 iterations Thus forall practical purposes an additional speedup of 40ndash80 canbe achieved by reducing the number of iterations whichyields convergence up to 3 orders of magnitude faster thanstandard FBG

6

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

00

02

04

06

08

10

Mic

ro-a

vera

ged F

-measu

re

Uncompressed Compressed

0 200 400 600 800 1000Iteration

00

02

04

06

08

10

Macr

o-a

vera

ged F

-measu

re

0 200 400 600 800 1000Iteration

0 25

08

10

0 25

08

10

Figure 4 The median value (black) and quantile ranges (in 5 steps) of the micro- (top) and macro-averaged (bottom) F-measuresfor uncompressed (left) and compressed (right) FBG inference on the same 129600 simulated data sets using automatic priors Thex-axis represents the number of iterations alone and does not reflect the additional speedup obtained through compression Noticethat the compressed HMM converges no later than 50 iterations (inset figures right)

7

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Coriell ATCC and breast carcinoma

The data provided by [72] includes 15 aCGH samples forthe Coriell cell line At about 2000 probes the data issmall compared to modern high-density arrays Neverthe-less these data sets have become a common standard toevaluate CNV calling methods as they contain few and sim-ple aberrations The data also contains 6 ATCC cell lines aswell as 4 breast carcinoma all of which are substantiallymore complicated and typically not used in software eval-uations In Fig 6 we demonstrate our ability to infer thecorrect segments on the most complex example a T47Dbreast ductal carcinoma sample of a 54 year old female Weused 6-state automatic priors with P(σ2 le 01) = 085 andall Dirichlet hyperparameters set to 1 On a standard laptopHaMMLET took 009 seconds for 1000 iterations runningtimes for the other samples were similar Our results for all25 data sets have been included in the supplement

Conclusions

In the analysis of biological data there are usually con-flicting objectives at play which need to be balanced therequired accuracy of the analysis ease of usemdashusing thesoftware setting software and method parametersmdashandoften the speed of a method Bayesian methods have ob-tained a reputation of requiring enormous computationaleffort and being difficult to use for the expert knowledgerequired for choosing prior distributions It has also beenrecognized [60 62 73] that they are very powerful andaccurate leading to improved high-quality results and pro-viding in the form of posterior distributions an accuratemeasure of uncertainty in results Nevertheless it is not sur-prising that a hundred times larger effort in computationaleffort alone prevented wide-spread use

Inferring Copy Number Variants (CNV) is a quite specialproblem as experts can identify CN changes visually atleast on very good data and for large drastic CN changes(e g long segments lost on both chromosomal copies)With lesser quality data smaller CN differences and in theanalysis of cohorts the need for objective highly accurateand automated methods are evident

The core idea for our method expands on our prior work[62] and affirms a conjecture by Lai et al [65] that a com-bination of smoothing and segmentation will likely improveresults One ingredient of our method are Haar waveletswhich have previously been used for pre-processing andvisualization [17 64] In a sense they quantify and identifyregions of high variation and allow to summarize the dataat various levels of resolution somewhat similar to howan expert would perform a visual analysis We combinefor the first time wavelets with a full Bayesian HMM bydynamically and adaptively infering blocks of subsequentobservations from our wavelet data structure The HMMoperates on blocks instead of individual observations whichleads to great saving in running times up to 350-fold com-

pared to the standard FB-Gibbs sampler and up to 288times faster than CBS Much more importantly operatingon the blocks greatly improves convergence of the samplerleading to higher accuracy for a much smaller number ofsampling iterations Thus the combination of wavelets andHMM realizes a simultaneous improvement on accuracyand on speed typically one can have one or the other Anintuitive explanation as to why this works is that the blocksderived from the wavelet structure allow efficient summarytreatment of those ldquoeasyrdquo to call segments given the currentsample of HMM parameters and identifies ldquohardrdquo to callCN segment which need the full computational effort fromFB-Gibbs Note that it is absolutely crucial that the blockstructure depends on the parameters sampled for the HMMand will change drastically over the run time This is incontrast to our prior work [62] which used static blocksand saw no improvements to accuracy and convergencespeed The data structures and linear-time algorithms weintroduce here provide the efficient means for recomputingthese blocks at every cycle of the sampling cf Fig 1 Com-pared to our prior work we observe a speed-up of up to3000 due to the greatly improved convergence O(T) vsO(T log T ) clustering improved numerics and lastly a C++instead of a Python implementation

We performed an extensive comparison with the state-of-the-art as identified by several review and benchmarkpublications and with the standard FB-Gibbs sampler on awide range of biological data sets and 129600 simulateddata sets which were produced by a simulation processnot based on HMM to make it a harder problem for ourmethod All comparisons demonstrated favorable resultsfor our method when measuring accuracy at a very notice-able acceleration compared to the state-of-the-art It mustbe stressed that these results were obtained with a statisti-cally sophisticated model for CNV calls and without cuttingalgorithmic corners but rather an effective allocation ofcomputational effort

All our computations are performed using our automaticprior which derives estimates for the hyperparameters ofthe priors for means and variances directly from the wavelettree structure and the resulting blocks The block structurealso imposes a prior on self-transition probabilities Theuser only has to provide an estimate of the noise variancebut as the automatic prior is designed to be weak the priorand thus the method is robust against incorrect estimatesWe have demonstrated this by using different hyperparam-eters for the associated Dirichlet priors in our simulationswhich HaMMLET is able to infer correctly regardless ofthe transition priors At the same time the automatic priorcan be used to tune certain aspects of the HMM if strongerprior knowledge is available We would expect further im-provements from non-automatic expert-selected priors butrefrained from using them for the evaluation as they mightbe perceived as unfair to other methods

In summary our method is as easy to use as other sta-tistically less sophisticated tools more accurate and much

8

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Figure 6 HaMMLETrsquos inference of copy-number segments on T47D breast ductal carcinoma Notice that the data is much morecomplex than the simple structure of a diploid majority class with some small aberrations typically observed for Coriell data

9

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

more computationally efficient We believe this makes itan excellent choice both for routine use in clinical settingsand principled re-analysis of large cohorts where the addedaccuracy and the much improved information about uncer-tainty in copy number calls from posterior marginal distri-butions will likely yield improved insights into CNV as asource of genetic variation and its relationship to disease

Methods

We will briefly review Forward-Backward Gibbs sampling(FBG) for Bayesian Hidden Markov Models and its accel-eration through compression of the data into blocks Byfirst considering the case of equal emission variances amongall states we show that optimal compression is equivalentto a concept called selective wavelet reconstruction follow-ing a classic proof in wavelet theory We then argue thatwavelet coefficient thresholding a variance-dependent mini-max estimator allows for compression even in the case ofunequal emission variances This allows the compression ofthe data to be adapted to the prior variance level at eachsampling iteration We then derive a simple data structureto dynamically create blocks with little overhead Whilewavelet approaches have been used for aCGH data before[25 29 30 63] our method provides the first combinationof wavelets and HMMs

Bayesian Hidden Markov Models

Let T be the length of the observation sequence which isequal to the number of probes An HMM can be representedas a statistical model (qA θ |y) with transition matrixA a latent state sequence q = (q0 q1 qTminus1) an observedemission sequence y = (y0 y1 yTminus1) and the emissionparameters θ The initial distribution of q0 is often includedand denoted as π but in a Bayesian setting it makes moresense to take π as a set of hyperparameters for q0

In the usual frequentist approach the state sequence q isinferred by first finding a maximum likelihood estimate ofthe parameters

(AMLθML) = arg max(A θ )

L (A θ |y)

using the Baum-Welsh algorithm [53 54] This is only guar-anteed to yield local optima as the likelihood function isnot convex Repeated random reinitializations are used tofind ldquogoodrdquo local optima but there are no guarantees forthis method Then the most likely state sequence giventhose parameters

q= argmaxqP(q |AMLθMLy)

is calculated using the Viterbi algorithm [55] However ifthere are only a few aberrations that is there is imbalancebetween classes the ML parameters tend to overfit the nor-mal state which is likely to yield incorrect segmentation

[62] Furthermore alternative segmentations given thoseparameters are also ignored as are the ones for alternativeparameters

The Bayesian approach is to calculate the distribution ofstate sequences directly by integrating out the emission andtransition variables

P(q |y) =int

A

int

θ

P(qA θ |y)dθ dA

Since this integral is intractable it has to be approximatedusing Markov Chain Monte Carlo techniques i e drawingN samples

(q(i)A (i)θ (i))sim P(qA θ |y)

and subsequently approximating marginal state probabili-ties by their frequency in the sample

P(qt = s |y)asymp1N

Nsum

i=1

I(q(i)t = s)

Thus for each position t we get a complete probabilitydistribution over the possible states As the marginals ofeach variable are explicitly defined by conditioning on theother variables an HMM lends itself to Gibbs samplingi e repeatedly sampling from the marginals (q |A θ y)(A |qθ y) and (θ |qA y) conditioned on the previouslysampled values Using Bayesrsquo formula and several condi-tional independence relations the sampling process can bewritten as

A sim P(A |q) prop P(q |A )P(A |πA )

θ sim P(θ |qy) prop P(qy |θ )P(θ |πθ ) and

qsim P(q |A yθ ) prop P(A yθ |q)P(q |πq)

where π represents hyperparameters to the prior distribu-tion Notice that q depends on a prior only in q0 so πq is aset of parameters from which the distribution of q0 is sam-pled (typically πq are parameters of a Dirichlet distribution)In non-Bayesian settings π often denotes P(q0) itself Typi-cally each prior will be conjugate i e it will be the sameclass of distributions as the posterior Thus πS and πA(k)the hyperparameters of q0 and the k-th row ofA will bethe αi of a Dirichlet distribution and πθ = (αβ νmicro0) willbe the parameters of a Normal-Inverse Gamma distribution

Though there are several sampling schemes available[58] has argued strongly in favor of Forward-BackwardGibbs sampling (FBG) [57] Variations of this have beenimplemented for segmentation of aCGH data before [6062 73] However sampling-based Bayesian methods suchas FBG are computationally expensive Recently [62] haveintroduced compressed FBG by sampling over a shortersequence of sufficient statistics of data segments whichare likely to come from the same underlying state LetB = (Bw)Ww=1 be a partition of y into W blocks Each blockBw contains nw elements Let ywk the k-th element in Bw

10

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

The forward variable αw( j) for this block needs to take intoaccount the nw emissions the transitions into state j andthe nw minus 1 self-transitions which yields

αw( j) = Anwminus1j j P(Bw |micro j σ

2j )

nwsum

i=1

αwminus1(i)Ai j and

P(Bw |microσ2) =nwprod

k=1

P(ywk |microσ2)

The ideal block structure would correspond to the actualunknown segmentation of the data Any subdivision thereofwould decrease the compression ratio and thus the speedupbut still allow for recovery of the true breakpoints In addi-tion such a segmentation would yield sufficient statisticsfor the likelihood computation that corresponds to the trueparameters of the state generating a segment Using wavelettheory we show that such a block structure can be easilyobtained

Wavelet theory preliminaries

Here we review some essential wavelet theory for detailssee [74 75] Let

ψ(x) =

1 0le x lt 12

minus1 12 le x lt 1

0 elsewhere

be the Haar wavelet [76] and ψ jk(x) = 2 j2ψ(2 j x minus k) jand k are called the scale and shift parameter Any square-integrable function over the unit interval f isin L2([01)) canbe approximated using the orthonormal basis ψ jk | j k isinZminus1le j 0le k le 2 jminus1 admitting a multiresolution analy-sis [77 78] Essentially this allows us to express a functionf (x) using scaled and shifted copies of one simple basisfunction ψ(x) which is spatially localized i e non-zero ononly a finite interval in x The Haar basis is particularlysuited for expressing piecewise constant functions

Finite data y = (y0 yTminus1) can be treated as an equidis-tant sample f (x) by scaling the indices to the unit intervalusing x t = t

T Let h = log2 T Then y can be expressedexactly as a linear combination over the Haar wavelet basisabove restricted to the maximum level of sampling resolu-tion ( j le hminus 1)

yt =sum

jk

d jkψ jk(x t)

The wavelet transform d =W y is an orthogonal endomor-phism and thus incurs neither redundancy nor loss of in-formation Surprisingly d can be computed in linear timeusing the pyramid algorithm [77]

Compression via wavelet shrinkage

The Haar wavelet transform has an important property con-necting it to block creation Let d be a vector obtained

by setting elements of d to zero then y =W ᵀd = W ᵀd iscalled selective wavelet reconstruction (SW) If all coefficientsd jk with ψ jk(x t) 6= ψ jk(x t+1) are set to zero for some tthen yt = yt+1 which implies a block structure on y Con-versely blocks of size gt 2 (to account for some pathologicalcases) can only be created using SW This general equivalencebetween SW and compression is central to our method Notethat y does not have to be computed explicitly the blockboundaries can be inferred from the position of zero-entriesin d alone

Assume all HMM states had the same emission varianceσ2 Since each state is associated with an emission meanfinding q can be viewed as a regression or smoothing prob-lem of finding an estimate micro of a piecewise constant functionmicro whose range is precisely the set of emission means i e

micro= f (x) y= f (x) + ε εsimiid N(0σ2)

Unfortunately regression methods typically do not limitthe number of distinct values recovered and will insteadreturn some estimate y 6= micro However if y is piecewiseconstant and minimizes microminus y the sample means of eachblock are close to the true emission means This yieldshigh likelihood for their corresponding state and hence astrong posterior distribution leading to fast convergenceFurthermore the change points in micromust be close to changepoints in y since moving block boundaries incurs additionalloss allowing for approximate recovery of true breakpointsy might however induce additional block boundaries thatreduce the compression ratio

In a series of ground-breaking papers Donoho Johnstoneet al [79ndash83] showed that SW could in theory be used as analmost ideal spatially adaptive regression method Assum-ing one could provide an oracle ∆(microy) that would knowthe true micro then there exists a method MSW(y∆) = W ᵀSWusing an optimal subset of wavelet coefficients provided by∆ such that the quadratic risk of ySW = W ᵀSWd is boundedas

microminus ySW22 = O

σ2 ln TT

By definition MSW would be the best compression methodunder the constraints of the Haar wavelet basis This boundis generally unattainable since the oracle cannot be queriedInstead they have shown that for a method MWCT(yλσ)called wavelet coefficient thresholding which sets coefficientsto zero whose absolute value is smaller than some thresholdλσ there exists some λT le

p2 ln T with yWCT = W ᵀWCTd

such that

microminus yWCT22 le (2 ln T + 1)

microminus ySW22 +σ2

T

This λT is minimax i e the maximum risk incured overall possible data sets is not larger than that of any otherthreshold and no better bound can be obtained It is noteasily computed but for large T on the order of tens tohundreds the universal threshold λu

T =p

2 ln T is asymp-totically minimax In other words for data large enough

11

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

to warrant compression universal thresholding is the bestmethod to approximate micro and thus the best wavelet-basedcompression scheme for a given noise level σ2

Integrating wavelet shrinkage into FBG

This compression method can easily be extended to multipleemission variances Since we use a thresholding methoddecreasing the variance simply subdivides existing blocks Ifthe threshold is set to the smallest emission variance amongall states y will approximately preserve the breakpointsaround those low-variance segments Those of high vari-ance are split into blocks of lower sample variance see[84 85] for an analytic expression While the variances forthe different states are not known FBG provides a priorisamples in each iteration We hence propose the follow-ing simple adaptation In each sampling iteration use thesmallest sampled variance parameter to create a new blocksequence via wavelet thresholding (Algorithm 1)

Algorithm 1 Dynamically adaptive FBG for HMMs

1 procedure HAMMLET(y πA πθ πq)2 T larr |y |3 λlarr

p2 ln T

4 A sim P(A |πA )5 θ sim P(θ |πθ )6 for i = 1 N do7 σminlarrminσi

σMADσi |σ2i isin θ

8 Create block sequence B from threshold λσmin9 qsim P(A Bθ |q)P(q |πq)

10 A sim P(q |A )P(A |πA )11 θ sim P(qB |θ )P(θ |πθ )12 Add count of marginal states for q to result13 end for14 end procedure

While intuitively easy to understand provable guaranteesfor the optimality of this method specifically the correspon-dence between the wavelet and the HMM domain remainan open research topic A potential problem could ariseif all sampled variances are too large In this case blockswould be under-segmented yield wrong posterior variancesand hide possible state transitions As a safeguard againstover-compression we use the standard method to estimatethe variance of constant noise in shrinkage applications

σ2MAD =

MADkdlog2 Tminus1k

Φminus1

34

2

as an estimate of the variance in the dominant com-ponent and modify the threshold definition to λ middotmin σMADσi isin θ If the data is not iid σ2

MAD will sys-tematically underestimate the true variance [24] In thiscase the blocks get smaller than necessary thus decreasingthe compression

A data structure for dynamic compression

The necessity to recreate a new block sequence in each it-eration based on the most recent estimate of the smallestvariance parameter creates the challenge of doing so with lit-tle computational overhead specifically without repeatedlycomputing the inverse wavelet transform or considering allT elements in other ways We achieve this by creating asimple tree-based data structure

The pyramid algorithm yields d sorted according to ( j k)Again let h = log2 T and ` = h minus j We can map thewavelet ψ jk to a perfect binary tree of height h such thatall wavelets for scale j are nodes on level ` nodes withineach level are sorted according to k and ` is increasingfrom the leaves to the root (Fig 7) d represents a breadth-first search (BFS) traversal of that tree with d jk being theentry at position b2 jc+ k Adding yi as the i-th leaf on level` = 0 each non-leaf node represents a wavelet which isnon-zero for the n = 2` data points yt for t in the theinterval I jk = [kn (k+ 1)nminus 1] stored in the leaves belownotice that for the leaves kn= t

This implies that the leaves in any subtree all have thesame value after wavelet thresholding if all the wavelets inthis subtree are set to zero We can hence avoid computingthe inverse wavelet transform to create blocks Instead eachnode stores the maximum absolute wavelet coefficient in theentire subtree as well as the sufficient statistics required forcalculating the likelihood function More formally a nodeN`t corresponds to wavelet ψ jk with ` = hminus j and t = k2`

(ψminus10 is simply constant on the [01) interval and has noeffect on block creation thus we discard it) Essentially `numbers the levels beginning at the leaves and t marks thestart position of the block when pruning the subtree rootedat N`t The members stored in each node are

bull The number of leaves corresponding to the block size

N`t[n] = 2`

bull The sum of data points stored in the subtree leaves

N`t[Σ1] =sum

iisinI jk

yi

bull Similarly the sum of squares

N`t[Σ2] =sum

iisinI jk

y2i

bull The maximum absolute wavelet coefficient of the sub-tree including the current d jk itself

N0t[d] = 0 N`gt0t[d] =max`primele`

tlet primeltt+2`

dhminus`prime2`primet prime

All these values can be computed recursively from the childnodes in linear time As some real data sets contain salt-and-pepper noise which manifests as isolated large coefficients

12

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

on the lowest level its is possible to ignore the first level inthe maximum computation so that no information to createa single-element block for outliers is passed up the tree Werefer to this technique as noise control Notice that this doesnot imply that blocks are only created at even t since truetransitions manifest in coefficients on multiple levels

The block creation algorithm is simple upon constructionthe tree is converted to depth-first search (DFS) order whichsimply amounts to sorting the BFS array according to (kn j)Given a threshold the tree is then traversed in DFS order byiterating linearly over the array (Fig 7 solid lines) Once themaximum coefficient stored in a node is less or equal to thethreshold a block of size n is created and the entire subtreeis skipped (dashed lines) As the tree is perfect binary andcomplete the next array position in DFS traversal afterpruning the subtree rooted at the node at index i is simplyobtained as i + 2nminus 1 so no expensive pointer structureneeds to be maintained leaving the tree data structure asimple flat array An example of dynamic block creation isgiven in Fig 8

Once the Gibbs sampler converges to a set of variancesthe block structure is less likely to change To avoid recre-ating the same block structure over and over again weemploy a technique called block structure prediction Sincethe different block structures are subdivisions of each otherthat occur in a specific order for decreasing σ2 there is asimple bijection between the number of blocks and the blockstructure itself Thus for each block sequence length weregister the minimum and maximum variance that createsthat sequence Upon entering a new iteration we checkif the current variance would create the same number ofblocks as in the previous iteration which guarantees thatwe would obtain the same block sequence and hence canavoid recomputation

The wavelet tree data structure can be readily extendedto multivariate data of dimensionality m Instead of storingm different trees and reconciling m different block patternsin each iteration one simply stores m different values foreach sufficient statistic in a tree node Since we have totraverse into the combined tree if the coefficient of any ofthe m trees was below the threshold we simply store thelargest N`t[d] among the corresponding nodes of the treeswhich means that the block creation can be done in O(T)instead of O(mT ) i e the dimensionality of the data onlyenters into the creation of the data structure but not thequery during sampling iterations

Automatic priors

While Bayesian methods allow for inductive bias such asthe expected location of means it is desirable to be able touse our method even when little domain knowledge existsor large variation is expected such as the lab and batcheffects commonly observed in micro-arrays [86] as wellas unknown means due to sample contamination SinceFBG does require a prior even in that case we propose the

0 5 10 15Starting position of block (t)

1

2

4

8

16

Bloc

k si

ze (n

)

DFS array (i+1)wavelet treepruning jumps (i+2n-1)

0

1

2

3

4

Leve

l ()

Figure 7 Mapping of wavelets ψ jk and data points yt to treenodes N`t Each node is the root of a subtree with n= 2` leavespruning that subtree yields a block of size n starting at positiont For instance the node N16 is located at position 13 of the DFSarray (solid line) and corresponds to the wavelet ψ33 A block ofsize n = 2 can be created by pruning the subtree which amountsto advancing by 2nminus 1= 3 positions (dashed line) yielding N38

at position 16 which is the wavelet ψ11 Thus the number ofsteps for creating blocks per iteration is at most the number ofnodes in the tree and thus strictly smaller than 2T

1248

163264

128256

Bloc

k si

ze

0 50 100 150 200 250Position along chromosome

10

05

00

05

10

15

20

25

Mea

sure

men

t

Figure 8 Example of dynamic block creation The data is of sizeT = 256 so the wavelet tree contains 512 nodes Here only 37entries had to be checked against the threshold (dark line) 19of which (round markers) yielded a block (vertical lines on thebottom) Sampling is hence done on a short array of 19 blocksinstead of 256 individual values thus the compression ratio is135 The horizontal lines in the bottom subplot are the blockmeans derived from the sufficient statistics in the nodes Noticehow the algorithm creates small blocks around the breakpointse g at t asymp 125 which requires traversing to lower levels andthus induces some additional blocks in other parts of the tree(left subtree) since all block sizes are powers of 2 This some-what reduces the compression ratio which is unproblematic as itincreases the degrees of freedom in the sampler

13

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

following method to specify hyperparameters of a weak priorautomatically Posterior samples of means and variances aredrawn from a Normal-Inverse Gamma distribution (microσ2)simNIΓ(micro0ναβ) whose marginals simply separate into aNormal and an Inverse Gamma distribution

σ2 sim IΓ(αβ) microsim N

micro0σ2

ν

Let s2 be a user-defined variance (or automatically inferede g from the largest of the finest detail coefficients orσ2

MAD) and p the desired probability to sample a variancenot larger than s2 From the CDF of IΓ we obtain

p = P(σ2 le s2) =Γ

α βs2

Γ (α)=Q

αβ

s2

IΓ has a mean for α gt 1 and closed-form solutions forα isin N Furthermore IΓ has positive skewness for α gt 3We thus let α= 2 which yields

β = minuss2

Wminus1

minuspe

+ 1

0lt p le 1

where Wminus1 is the negative branch of the Lambert W -functionwhich is transcendental However an excellent analyticalapproximation with a maximum error of 0025 is given in[87] which yields

β asymp s2

2p

b

M1

pb+p

2

M2 b exp

M3

pb

+ 1 + b

b = minus ln p

M1 = 03361 M2 = minus00042 M3 = minus00201

Since the mean of IΓ is βαminus1 the expected variance of micro is βν

for α= 2 To ensure proper mixing we could simply set βmicroto the sample variance of the data which can be estimatedfrom the sufficient statistics in the root of the wavelet tree(the first entry in the array) provided that micro contained allstates in almost equal number However due to possibleclass imbalance means for short segments far away from micro0can have low sampling probability as they do not contributemuch to the sample variance of the data We thus define δ tobe the sample variance of block means in the compressionobtained by σ2

MAD and take the maximum of those twovariances We thus obtain

micro0 =Σ1

n and ν= βmax

nΣ2 minusΣ21

n2δ

minus1

Numerical issues

To assure numerical stability many HMM implementationsresort to log-space computations Our implementationwhich differs from [62] reflects the block structure andavoids a considerable number of expensive function calls(exp log pow) required by others This yields an additional

speedup of about one order of magnitude compared to thelog version (data not shown) The term accounting for emis-sions and self-transitions within the block can be writtenas

Anwminus1j j

(2π)nw2σnwj

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

Any constant cancels out during normalization Further-more exponentiation of potentially small numbers causesunderflows We hence move those terms into the exponentutilizing the much stabler logarithm function

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

+ (nw minus 1) log A j j minus nw logσ j

Using the blockrsquos sufficient statistics

nw Σ1 =nwsum

k=1

ywk Σ2 =nwsum

k=1

y2wk

the exponent can be rewritten as

Ew( j) =2micro jΣ1 minusΣ2

2σ2j

+ K(nw j)

K(nw j) = (nw minus 1) log A j j minus nw

logσ j +micro2

j

2σ2j

Note that K(nw j) can be precomputed for each iterationthus greatly reducing the number of expensive function callsTo avoid overflow of the exponential function we subtractthe largest such exponents among all states hence Ew( j)le0 This is equivalent to dividing the forward variables by

exp

maxk

Ew(k)

which cancels out during normalization Hence we obtain

αw( j) = exp

Ew( j)minusmaxk

Ew(k)

nwsum

i=1

αwminus1(i)Ai j

which are then normalized to

αw( j) =αw( j)

sum

k αw(k)

Availability of supporting data

The implementation as well as the supplemental data areavailable at httpbioinformaticsrutgerseduSoftwareHaMMLET

Competing interests

The authors declare that the research was conducted in theabsence of any commercial or financial relationships thatcould be construed as a potential conflict of interest

14

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 7: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

00

02

04

06

08

10

Mic

ro-a

vera

ged F

-measu

re

Uncompressed Compressed

0 200 400 600 800 1000Iteration

00

02

04

06

08

10

Macr

o-a

vera

ged F

-measu

re

0 200 400 600 800 1000Iteration

0 25

08

10

0 25

08

10

Figure 4 The median value (black) and quantile ranges (in 5 steps) of the micro- (top) and macro-averaged (bottom) F-measuresfor uncompressed (left) and compressed (right) FBG inference on the same 129600 simulated data sets using automatic priors Thex-axis represents the number of iterations alone and does not reflect the additional speedup obtained through compression Noticethat the compressed HMM converges no later than 50 iterations (inset figures right)

7

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Coriell ATCC and breast carcinoma

The data provided by [72] includes 15 aCGH samples forthe Coriell cell line At about 2000 probes the data issmall compared to modern high-density arrays Neverthe-less these data sets have become a common standard toevaluate CNV calling methods as they contain few and sim-ple aberrations The data also contains 6 ATCC cell lines aswell as 4 breast carcinoma all of which are substantiallymore complicated and typically not used in software eval-uations In Fig 6 we demonstrate our ability to infer thecorrect segments on the most complex example a T47Dbreast ductal carcinoma sample of a 54 year old female Weused 6-state automatic priors with P(σ2 le 01) = 085 andall Dirichlet hyperparameters set to 1 On a standard laptopHaMMLET took 009 seconds for 1000 iterations runningtimes for the other samples were similar Our results for all25 data sets have been included in the supplement

Conclusions

In the analysis of biological data there are usually con-flicting objectives at play which need to be balanced therequired accuracy of the analysis ease of usemdashusing thesoftware setting software and method parametersmdashandoften the speed of a method Bayesian methods have ob-tained a reputation of requiring enormous computationaleffort and being difficult to use for the expert knowledgerequired for choosing prior distributions It has also beenrecognized [60 62 73] that they are very powerful andaccurate leading to improved high-quality results and pro-viding in the form of posterior distributions an accuratemeasure of uncertainty in results Nevertheless it is not sur-prising that a hundred times larger effort in computationaleffort alone prevented wide-spread use

Inferring Copy Number Variants (CNV) is a quite specialproblem as experts can identify CN changes visually atleast on very good data and for large drastic CN changes(e g long segments lost on both chromosomal copies)With lesser quality data smaller CN differences and in theanalysis of cohorts the need for objective highly accurateand automated methods are evident

The core idea for our method expands on our prior work[62] and affirms a conjecture by Lai et al [65] that a com-bination of smoothing and segmentation will likely improveresults One ingredient of our method are Haar waveletswhich have previously been used for pre-processing andvisualization [17 64] In a sense they quantify and identifyregions of high variation and allow to summarize the dataat various levels of resolution somewhat similar to howan expert would perform a visual analysis We combinefor the first time wavelets with a full Bayesian HMM bydynamically and adaptively infering blocks of subsequentobservations from our wavelet data structure The HMMoperates on blocks instead of individual observations whichleads to great saving in running times up to 350-fold com-

pared to the standard FB-Gibbs sampler and up to 288times faster than CBS Much more importantly operatingon the blocks greatly improves convergence of the samplerleading to higher accuracy for a much smaller number ofsampling iterations Thus the combination of wavelets andHMM realizes a simultaneous improvement on accuracyand on speed typically one can have one or the other Anintuitive explanation as to why this works is that the blocksderived from the wavelet structure allow efficient summarytreatment of those ldquoeasyrdquo to call segments given the currentsample of HMM parameters and identifies ldquohardrdquo to callCN segment which need the full computational effort fromFB-Gibbs Note that it is absolutely crucial that the blockstructure depends on the parameters sampled for the HMMand will change drastically over the run time This is incontrast to our prior work [62] which used static blocksand saw no improvements to accuracy and convergencespeed The data structures and linear-time algorithms weintroduce here provide the efficient means for recomputingthese blocks at every cycle of the sampling cf Fig 1 Com-pared to our prior work we observe a speed-up of up to3000 due to the greatly improved convergence O(T) vsO(T log T ) clustering improved numerics and lastly a C++instead of a Python implementation

We performed an extensive comparison with the state-of-the-art as identified by several review and benchmarkpublications and with the standard FB-Gibbs sampler on awide range of biological data sets and 129600 simulateddata sets which were produced by a simulation processnot based on HMM to make it a harder problem for ourmethod All comparisons demonstrated favorable resultsfor our method when measuring accuracy at a very notice-able acceleration compared to the state-of-the-art It mustbe stressed that these results were obtained with a statisti-cally sophisticated model for CNV calls and without cuttingalgorithmic corners but rather an effective allocation ofcomputational effort

All our computations are performed using our automaticprior which derives estimates for the hyperparameters ofthe priors for means and variances directly from the wavelettree structure and the resulting blocks The block structurealso imposes a prior on self-transition probabilities Theuser only has to provide an estimate of the noise variancebut as the automatic prior is designed to be weak the priorand thus the method is robust against incorrect estimatesWe have demonstrated this by using different hyperparam-eters for the associated Dirichlet priors in our simulationswhich HaMMLET is able to infer correctly regardless ofthe transition priors At the same time the automatic priorcan be used to tune certain aspects of the HMM if strongerprior knowledge is available We would expect further im-provements from non-automatic expert-selected priors butrefrained from using them for the evaluation as they mightbe perceived as unfair to other methods

In summary our method is as easy to use as other sta-tistically less sophisticated tools more accurate and much

8

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Figure 6 HaMMLETrsquos inference of copy-number segments on T47D breast ductal carcinoma Notice that the data is much morecomplex than the simple structure of a diploid majority class with some small aberrations typically observed for Coriell data

9

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

more computationally efficient We believe this makes itan excellent choice both for routine use in clinical settingsand principled re-analysis of large cohorts where the addedaccuracy and the much improved information about uncer-tainty in copy number calls from posterior marginal distri-butions will likely yield improved insights into CNV as asource of genetic variation and its relationship to disease

Methods

We will briefly review Forward-Backward Gibbs sampling(FBG) for Bayesian Hidden Markov Models and its accel-eration through compression of the data into blocks Byfirst considering the case of equal emission variances amongall states we show that optimal compression is equivalentto a concept called selective wavelet reconstruction follow-ing a classic proof in wavelet theory We then argue thatwavelet coefficient thresholding a variance-dependent mini-max estimator allows for compression even in the case ofunequal emission variances This allows the compression ofthe data to be adapted to the prior variance level at eachsampling iteration We then derive a simple data structureto dynamically create blocks with little overhead Whilewavelet approaches have been used for aCGH data before[25 29 30 63] our method provides the first combinationof wavelets and HMMs

Bayesian Hidden Markov Models

Let T be the length of the observation sequence which isequal to the number of probes An HMM can be representedas a statistical model (qA θ |y) with transition matrixA a latent state sequence q = (q0 q1 qTminus1) an observedemission sequence y = (y0 y1 yTminus1) and the emissionparameters θ The initial distribution of q0 is often includedand denoted as π but in a Bayesian setting it makes moresense to take π as a set of hyperparameters for q0

In the usual frequentist approach the state sequence q isinferred by first finding a maximum likelihood estimate ofthe parameters

(AMLθML) = arg max(A θ )

L (A θ |y)

using the Baum-Welsh algorithm [53 54] This is only guar-anteed to yield local optima as the likelihood function isnot convex Repeated random reinitializations are used tofind ldquogoodrdquo local optima but there are no guarantees forthis method Then the most likely state sequence giventhose parameters

q= argmaxqP(q |AMLθMLy)

is calculated using the Viterbi algorithm [55] However ifthere are only a few aberrations that is there is imbalancebetween classes the ML parameters tend to overfit the nor-mal state which is likely to yield incorrect segmentation

[62] Furthermore alternative segmentations given thoseparameters are also ignored as are the ones for alternativeparameters

The Bayesian approach is to calculate the distribution ofstate sequences directly by integrating out the emission andtransition variables

P(q |y) =int

A

int

θ

P(qA θ |y)dθ dA

Since this integral is intractable it has to be approximatedusing Markov Chain Monte Carlo techniques i e drawingN samples

(q(i)A (i)θ (i))sim P(qA θ |y)

and subsequently approximating marginal state probabili-ties by their frequency in the sample

P(qt = s |y)asymp1N

Nsum

i=1

I(q(i)t = s)

Thus for each position t we get a complete probabilitydistribution over the possible states As the marginals ofeach variable are explicitly defined by conditioning on theother variables an HMM lends itself to Gibbs samplingi e repeatedly sampling from the marginals (q |A θ y)(A |qθ y) and (θ |qA y) conditioned on the previouslysampled values Using Bayesrsquo formula and several condi-tional independence relations the sampling process can bewritten as

A sim P(A |q) prop P(q |A )P(A |πA )

θ sim P(θ |qy) prop P(qy |θ )P(θ |πθ ) and

qsim P(q |A yθ ) prop P(A yθ |q)P(q |πq)

where π represents hyperparameters to the prior distribu-tion Notice that q depends on a prior only in q0 so πq is aset of parameters from which the distribution of q0 is sam-pled (typically πq are parameters of a Dirichlet distribution)In non-Bayesian settings π often denotes P(q0) itself Typi-cally each prior will be conjugate i e it will be the sameclass of distributions as the posterior Thus πS and πA(k)the hyperparameters of q0 and the k-th row ofA will bethe αi of a Dirichlet distribution and πθ = (αβ νmicro0) willbe the parameters of a Normal-Inverse Gamma distribution

Though there are several sampling schemes available[58] has argued strongly in favor of Forward-BackwardGibbs sampling (FBG) [57] Variations of this have beenimplemented for segmentation of aCGH data before [6062 73] However sampling-based Bayesian methods suchas FBG are computationally expensive Recently [62] haveintroduced compressed FBG by sampling over a shortersequence of sufficient statistics of data segments whichare likely to come from the same underlying state LetB = (Bw)Ww=1 be a partition of y into W blocks Each blockBw contains nw elements Let ywk the k-th element in Bw

10

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

The forward variable αw( j) for this block needs to take intoaccount the nw emissions the transitions into state j andthe nw minus 1 self-transitions which yields

αw( j) = Anwminus1j j P(Bw |micro j σ

2j )

nwsum

i=1

αwminus1(i)Ai j and

P(Bw |microσ2) =nwprod

k=1

P(ywk |microσ2)

The ideal block structure would correspond to the actualunknown segmentation of the data Any subdivision thereofwould decrease the compression ratio and thus the speedupbut still allow for recovery of the true breakpoints In addi-tion such a segmentation would yield sufficient statisticsfor the likelihood computation that corresponds to the trueparameters of the state generating a segment Using wavelettheory we show that such a block structure can be easilyobtained

Wavelet theory preliminaries

Here we review some essential wavelet theory for detailssee [74 75] Let

ψ(x) =

1 0le x lt 12

minus1 12 le x lt 1

0 elsewhere

be the Haar wavelet [76] and ψ jk(x) = 2 j2ψ(2 j x minus k) jand k are called the scale and shift parameter Any square-integrable function over the unit interval f isin L2([01)) canbe approximated using the orthonormal basis ψ jk | j k isinZminus1le j 0le k le 2 jminus1 admitting a multiresolution analy-sis [77 78] Essentially this allows us to express a functionf (x) using scaled and shifted copies of one simple basisfunction ψ(x) which is spatially localized i e non-zero ononly a finite interval in x The Haar basis is particularlysuited for expressing piecewise constant functions

Finite data y = (y0 yTminus1) can be treated as an equidis-tant sample f (x) by scaling the indices to the unit intervalusing x t = t

T Let h = log2 T Then y can be expressedexactly as a linear combination over the Haar wavelet basisabove restricted to the maximum level of sampling resolu-tion ( j le hminus 1)

yt =sum

jk

d jkψ jk(x t)

The wavelet transform d =W y is an orthogonal endomor-phism and thus incurs neither redundancy nor loss of in-formation Surprisingly d can be computed in linear timeusing the pyramid algorithm [77]

Compression via wavelet shrinkage

The Haar wavelet transform has an important property con-necting it to block creation Let d be a vector obtained

by setting elements of d to zero then y =W ᵀd = W ᵀd iscalled selective wavelet reconstruction (SW) If all coefficientsd jk with ψ jk(x t) 6= ψ jk(x t+1) are set to zero for some tthen yt = yt+1 which implies a block structure on y Con-versely blocks of size gt 2 (to account for some pathologicalcases) can only be created using SW This general equivalencebetween SW and compression is central to our method Notethat y does not have to be computed explicitly the blockboundaries can be inferred from the position of zero-entriesin d alone

Assume all HMM states had the same emission varianceσ2 Since each state is associated with an emission meanfinding q can be viewed as a regression or smoothing prob-lem of finding an estimate micro of a piecewise constant functionmicro whose range is precisely the set of emission means i e

micro= f (x) y= f (x) + ε εsimiid N(0σ2)

Unfortunately regression methods typically do not limitthe number of distinct values recovered and will insteadreturn some estimate y 6= micro However if y is piecewiseconstant and minimizes microminus y the sample means of eachblock are close to the true emission means This yieldshigh likelihood for their corresponding state and hence astrong posterior distribution leading to fast convergenceFurthermore the change points in micromust be close to changepoints in y since moving block boundaries incurs additionalloss allowing for approximate recovery of true breakpointsy might however induce additional block boundaries thatreduce the compression ratio

In a series of ground-breaking papers Donoho Johnstoneet al [79ndash83] showed that SW could in theory be used as analmost ideal spatially adaptive regression method Assum-ing one could provide an oracle ∆(microy) that would knowthe true micro then there exists a method MSW(y∆) = W ᵀSWusing an optimal subset of wavelet coefficients provided by∆ such that the quadratic risk of ySW = W ᵀSWd is boundedas

microminus ySW22 = O

σ2 ln TT

By definition MSW would be the best compression methodunder the constraints of the Haar wavelet basis This boundis generally unattainable since the oracle cannot be queriedInstead they have shown that for a method MWCT(yλσ)called wavelet coefficient thresholding which sets coefficientsto zero whose absolute value is smaller than some thresholdλσ there exists some λT le

p2 ln T with yWCT = W ᵀWCTd

such that

microminus yWCT22 le (2 ln T + 1)

microminus ySW22 +σ2

T

This λT is minimax i e the maximum risk incured overall possible data sets is not larger than that of any otherthreshold and no better bound can be obtained It is noteasily computed but for large T on the order of tens tohundreds the universal threshold λu

T =p

2 ln T is asymp-totically minimax In other words for data large enough

11

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

to warrant compression universal thresholding is the bestmethod to approximate micro and thus the best wavelet-basedcompression scheme for a given noise level σ2

Integrating wavelet shrinkage into FBG

This compression method can easily be extended to multipleemission variances Since we use a thresholding methoddecreasing the variance simply subdivides existing blocks Ifthe threshold is set to the smallest emission variance amongall states y will approximately preserve the breakpointsaround those low-variance segments Those of high vari-ance are split into blocks of lower sample variance see[84 85] for an analytic expression While the variances forthe different states are not known FBG provides a priorisamples in each iteration We hence propose the follow-ing simple adaptation In each sampling iteration use thesmallest sampled variance parameter to create a new blocksequence via wavelet thresholding (Algorithm 1)

Algorithm 1 Dynamically adaptive FBG for HMMs

1 procedure HAMMLET(y πA πθ πq)2 T larr |y |3 λlarr

p2 ln T

4 A sim P(A |πA )5 θ sim P(θ |πθ )6 for i = 1 N do7 σminlarrminσi

σMADσi |σ2i isin θ

8 Create block sequence B from threshold λσmin9 qsim P(A Bθ |q)P(q |πq)

10 A sim P(q |A )P(A |πA )11 θ sim P(qB |θ )P(θ |πθ )12 Add count of marginal states for q to result13 end for14 end procedure

While intuitively easy to understand provable guaranteesfor the optimality of this method specifically the correspon-dence between the wavelet and the HMM domain remainan open research topic A potential problem could ariseif all sampled variances are too large In this case blockswould be under-segmented yield wrong posterior variancesand hide possible state transitions As a safeguard againstover-compression we use the standard method to estimatethe variance of constant noise in shrinkage applications

σ2MAD =

MADkdlog2 Tminus1k

Φminus1

34

2

as an estimate of the variance in the dominant com-ponent and modify the threshold definition to λ middotmin σMADσi isin θ If the data is not iid σ2

MAD will sys-tematically underestimate the true variance [24] In thiscase the blocks get smaller than necessary thus decreasingthe compression

A data structure for dynamic compression

The necessity to recreate a new block sequence in each it-eration based on the most recent estimate of the smallestvariance parameter creates the challenge of doing so with lit-tle computational overhead specifically without repeatedlycomputing the inverse wavelet transform or considering allT elements in other ways We achieve this by creating asimple tree-based data structure

The pyramid algorithm yields d sorted according to ( j k)Again let h = log2 T and ` = h minus j We can map thewavelet ψ jk to a perfect binary tree of height h such thatall wavelets for scale j are nodes on level ` nodes withineach level are sorted according to k and ` is increasingfrom the leaves to the root (Fig 7) d represents a breadth-first search (BFS) traversal of that tree with d jk being theentry at position b2 jc+ k Adding yi as the i-th leaf on level` = 0 each non-leaf node represents a wavelet which isnon-zero for the n = 2` data points yt for t in the theinterval I jk = [kn (k+ 1)nminus 1] stored in the leaves belownotice that for the leaves kn= t

This implies that the leaves in any subtree all have thesame value after wavelet thresholding if all the wavelets inthis subtree are set to zero We can hence avoid computingthe inverse wavelet transform to create blocks Instead eachnode stores the maximum absolute wavelet coefficient in theentire subtree as well as the sufficient statistics required forcalculating the likelihood function More formally a nodeN`t corresponds to wavelet ψ jk with ` = hminus j and t = k2`

(ψminus10 is simply constant on the [01) interval and has noeffect on block creation thus we discard it) Essentially `numbers the levels beginning at the leaves and t marks thestart position of the block when pruning the subtree rootedat N`t The members stored in each node are

bull The number of leaves corresponding to the block size

N`t[n] = 2`

bull The sum of data points stored in the subtree leaves

N`t[Σ1] =sum

iisinI jk

yi

bull Similarly the sum of squares

N`t[Σ2] =sum

iisinI jk

y2i

bull The maximum absolute wavelet coefficient of the sub-tree including the current d jk itself

N0t[d] = 0 N`gt0t[d] =max`primele`

tlet primeltt+2`

dhminus`prime2`primet prime

All these values can be computed recursively from the childnodes in linear time As some real data sets contain salt-and-pepper noise which manifests as isolated large coefficients

12

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

on the lowest level its is possible to ignore the first level inthe maximum computation so that no information to createa single-element block for outliers is passed up the tree Werefer to this technique as noise control Notice that this doesnot imply that blocks are only created at even t since truetransitions manifest in coefficients on multiple levels

The block creation algorithm is simple upon constructionthe tree is converted to depth-first search (DFS) order whichsimply amounts to sorting the BFS array according to (kn j)Given a threshold the tree is then traversed in DFS order byiterating linearly over the array (Fig 7 solid lines) Once themaximum coefficient stored in a node is less or equal to thethreshold a block of size n is created and the entire subtreeis skipped (dashed lines) As the tree is perfect binary andcomplete the next array position in DFS traversal afterpruning the subtree rooted at the node at index i is simplyobtained as i + 2nminus 1 so no expensive pointer structureneeds to be maintained leaving the tree data structure asimple flat array An example of dynamic block creation isgiven in Fig 8

Once the Gibbs sampler converges to a set of variancesthe block structure is less likely to change To avoid recre-ating the same block structure over and over again weemploy a technique called block structure prediction Sincethe different block structures are subdivisions of each otherthat occur in a specific order for decreasing σ2 there is asimple bijection between the number of blocks and the blockstructure itself Thus for each block sequence length weregister the minimum and maximum variance that createsthat sequence Upon entering a new iteration we checkif the current variance would create the same number ofblocks as in the previous iteration which guarantees thatwe would obtain the same block sequence and hence canavoid recomputation

The wavelet tree data structure can be readily extendedto multivariate data of dimensionality m Instead of storingm different trees and reconciling m different block patternsin each iteration one simply stores m different values foreach sufficient statistic in a tree node Since we have totraverse into the combined tree if the coefficient of any ofthe m trees was below the threshold we simply store thelargest N`t[d] among the corresponding nodes of the treeswhich means that the block creation can be done in O(T)instead of O(mT ) i e the dimensionality of the data onlyenters into the creation of the data structure but not thequery during sampling iterations

Automatic priors

While Bayesian methods allow for inductive bias such asthe expected location of means it is desirable to be able touse our method even when little domain knowledge existsor large variation is expected such as the lab and batcheffects commonly observed in micro-arrays [86] as wellas unknown means due to sample contamination SinceFBG does require a prior even in that case we propose the

0 5 10 15Starting position of block (t)

1

2

4

8

16

Bloc

k si

ze (n

)

DFS array (i+1)wavelet treepruning jumps (i+2n-1)

0

1

2

3

4

Leve

l ()

Figure 7 Mapping of wavelets ψ jk and data points yt to treenodes N`t Each node is the root of a subtree with n= 2` leavespruning that subtree yields a block of size n starting at positiont For instance the node N16 is located at position 13 of the DFSarray (solid line) and corresponds to the wavelet ψ33 A block ofsize n = 2 can be created by pruning the subtree which amountsto advancing by 2nminus 1= 3 positions (dashed line) yielding N38

at position 16 which is the wavelet ψ11 Thus the number ofsteps for creating blocks per iteration is at most the number ofnodes in the tree and thus strictly smaller than 2T

1248

163264

128256

Bloc

k si

ze

0 50 100 150 200 250Position along chromosome

10

05

00

05

10

15

20

25

Mea

sure

men

t

Figure 8 Example of dynamic block creation The data is of sizeT = 256 so the wavelet tree contains 512 nodes Here only 37entries had to be checked against the threshold (dark line) 19of which (round markers) yielded a block (vertical lines on thebottom) Sampling is hence done on a short array of 19 blocksinstead of 256 individual values thus the compression ratio is135 The horizontal lines in the bottom subplot are the blockmeans derived from the sufficient statistics in the nodes Noticehow the algorithm creates small blocks around the breakpointse g at t asymp 125 which requires traversing to lower levels andthus induces some additional blocks in other parts of the tree(left subtree) since all block sizes are powers of 2 This some-what reduces the compression ratio which is unproblematic as itincreases the degrees of freedom in the sampler

13

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

following method to specify hyperparameters of a weak priorautomatically Posterior samples of means and variances aredrawn from a Normal-Inverse Gamma distribution (microσ2)simNIΓ(micro0ναβ) whose marginals simply separate into aNormal and an Inverse Gamma distribution

σ2 sim IΓ(αβ) microsim N

micro0σ2

ν

Let s2 be a user-defined variance (or automatically inferede g from the largest of the finest detail coefficients orσ2

MAD) and p the desired probability to sample a variancenot larger than s2 From the CDF of IΓ we obtain

p = P(σ2 le s2) =Γ

α βs2

Γ (α)=Q

αβ

s2

IΓ has a mean for α gt 1 and closed-form solutions forα isin N Furthermore IΓ has positive skewness for α gt 3We thus let α= 2 which yields

β = minuss2

Wminus1

minuspe

+ 1

0lt p le 1

where Wminus1 is the negative branch of the Lambert W -functionwhich is transcendental However an excellent analyticalapproximation with a maximum error of 0025 is given in[87] which yields

β asymp s2

2p

b

M1

pb+p

2

M2 b exp

M3

pb

+ 1 + b

b = minus ln p

M1 = 03361 M2 = minus00042 M3 = minus00201

Since the mean of IΓ is βαminus1 the expected variance of micro is βν

for α= 2 To ensure proper mixing we could simply set βmicroto the sample variance of the data which can be estimatedfrom the sufficient statistics in the root of the wavelet tree(the first entry in the array) provided that micro contained allstates in almost equal number However due to possibleclass imbalance means for short segments far away from micro0can have low sampling probability as they do not contributemuch to the sample variance of the data We thus define δ tobe the sample variance of block means in the compressionobtained by σ2

MAD and take the maximum of those twovariances We thus obtain

micro0 =Σ1

n and ν= βmax

nΣ2 minusΣ21

n2δ

minus1

Numerical issues

To assure numerical stability many HMM implementationsresort to log-space computations Our implementationwhich differs from [62] reflects the block structure andavoids a considerable number of expensive function calls(exp log pow) required by others This yields an additional

speedup of about one order of magnitude compared to thelog version (data not shown) The term accounting for emis-sions and self-transitions within the block can be writtenas

Anwminus1j j

(2π)nw2σnwj

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

Any constant cancels out during normalization Further-more exponentiation of potentially small numbers causesunderflows We hence move those terms into the exponentutilizing the much stabler logarithm function

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

+ (nw minus 1) log A j j minus nw logσ j

Using the blockrsquos sufficient statistics

nw Σ1 =nwsum

k=1

ywk Σ2 =nwsum

k=1

y2wk

the exponent can be rewritten as

Ew( j) =2micro jΣ1 minusΣ2

2σ2j

+ K(nw j)

K(nw j) = (nw minus 1) log A j j minus nw

logσ j +micro2

j

2σ2j

Note that K(nw j) can be precomputed for each iterationthus greatly reducing the number of expensive function callsTo avoid overflow of the exponential function we subtractthe largest such exponents among all states hence Ew( j)le0 This is equivalent to dividing the forward variables by

exp

maxk

Ew(k)

which cancels out during normalization Hence we obtain

αw( j) = exp

Ew( j)minusmaxk

Ew(k)

nwsum

i=1

αwminus1(i)Ai j

which are then normalized to

αw( j) =αw( j)

sum

k αw(k)

Availability of supporting data

The implementation as well as the supplemental data areavailable at httpbioinformaticsrutgerseduSoftwareHaMMLET

Competing interests

The authors declare that the research was conducted in theabsence of any commercial or financial relationships thatcould be construed as a potential conflict of interest

14

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 8: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

Coriell ATCC and breast carcinoma

The data provided by [72] includes 15 aCGH samples forthe Coriell cell line At about 2000 probes the data issmall compared to modern high-density arrays Neverthe-less these data sets have become a common standard toevaluate CNV calling methods as they contain few and sim-ple aberrations The data also contains 6 ATCC cell lines aswell as 4 breast carcinoma all of which are substantiallymore complicated and typically not used in software eval-uations In Fig 6 we demonstrate our ability to infer thecorrect segments on the most complex example a T47Dbreast ductal carcinoma sample of a 54 year old female Weused 6-state automatic priors with P(σ2 le 01) = 085 andall Dirichlet hyperparameters set to 1 On a standard laptopHaMMLET took 009 seconds for 1000 iterations runningtimes for the other samples were similar Our results for all25 data sets have been included in the supplement

Conclusions

In the analysis of biological data there are usually con-flicting objectives at play which need to be balanced therequired accuracy of the analysis ease of usemdashusing thesoftware setting software and method parametersmdashandoften the speed of a method Bayesian methods have ob-tained a reputation of requiring enormous computationaleffort and being difficult to use for the expert knowledgerequired for choosing prior distributions It has also beenrecognized [60 62 73] that they are very powerful andaccurate leading to improved high-quality results and pro-viding in the form of posterior distributions an accuratemeasure of uncertainty in results Nevertheless it is not sur-prising that a hundred times larger effort in computationaleffort alone prevented wide-spread use

Inferring Copy Number Variants (CNV) is a quite specialproblem as experts can identify CN changes visually atleast on very good data and for large drastic CN changes(e g long segments lost on both chromosomal copies)With lesser quality data smaller CN differences and in theanalysis of cohorts the need for objective highly accurateand automated methods are evident

The core idea for our method expands on our prior work[62] and affirms a conjecture by Lai et al [65] that a com-bination of smoothing and segmentation will likely improveresults One ingredient of our method are Haar waveletswhich have previously been used for pre-processing andvisualization [17 64] In a sense they quantify and identifyregions of high variation and allow to summarize the dataat various levels of resolution somewhat similar to howan expert would perform a visual analysis We combinefor the first time wavelets with a full Bayesian HMM bydynamically and adaptively infering blocks of subsequentobservations from our wavelet data structure The HMMoperates on blocks instead of individual observations whichleads to great saving in running times up to 350-fold com-

pared to the standard FB-Gibbs sampler and up to 288times faster than CBS Much more importantly operatingon the blocks greatly improves convergence of the samplerleading to higher accuracy for a much smaller number ofsampling iterations Thus the combination of wavelets andHMM realizes a simultaneous improvement on accuracyand on speed typically one can have one or the other Anintuitive explanation as to why this works is that the blocksderived from the wavelet structure allow efficient summarytreatment of those ldquoeasyrdquo to call segments given the currentsample of HMM parameters and identifies ldquohardrdquo to callCN segment which need the full computational effort fromFB-Gibbs Note that it is absolutely crucial that the blockstructure depends on the parameters sampled for the HMMand will change drastically over the run time This is incontrast to our prior work [62] which used static blocksand saw no improvements to accuracy and convergencespeed The data structures and linear-time algorithms weintroduce here provide the efficient means for recomputingthese blocks at every cycle of the sampling cf Fig 1 Com-pared to our prior work we observe a speed-up of up to3000 due to the greatly improved convergence O(T) vsO(T log T ) clustering improved numerics and lastly a C++instead of a Python implementation

We performed an extensive comparison with the state-of-the-art as identified by several review and benchmarkpublications and with the standard FB-Gibbs sampler on awide range of biological data sets and 129600 simulateddata sets which were produced by a simulation processnot based on HMM to make it a harder problem for ourmethod All comparisons demonstrated favorable resultsfor our method when measuring accuracy at a very notice-able acceleration compared to the state-of-the-art It mustbe stressed that these results were obtained with a statisti-cally sophisticated model for CNV calls and without cuttingalgorithmic corners but rather an effective allocation ofcomputational effort

All our computations are performed using our automaticprior which derives estimates for the hyperparameters ofthe priors for means and variances directly from the wavelettree structure and the resulting blocks The block structurealso imposes a prior on self-transition probabilities Theuser only has to provide an estimate of the noise variancebut as the automatic prior is designed to be weak the priorand thus the method is robust against incorrect estimatesWe have demonstrated this by using different hyperparam-eters for the associated Dirichlet priors in our simulationswhich HaMMLET is able to infer correctly regardless ofthe transition priors At the same time the automatic priorcan be used to tune certain aspects of the HMM if strongerprior knowledge is available We would expect further im-provements from non-automatic expert-selected priors butrefrained from using them for the evaluation as they mightbe perceived as unfair to other methods

In summary our method is as easy to use as other sta-tistically less sophisticated tools more accurate and much

8

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Figure 6 HaMMLETrsquos inference of copy-number segments on T47D breast ductal carcinoma Notice that the data is much morecomplex than the simple structure of a diploid majority class with some small aberrations typically observed for Coriell data

9

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

more computationally efficient We believe this makes itan excellent choice both for routine use in clinical settingsand principled re-analysis of large cohorts where the addedaccuracy and the much improved information about uncer-tainty in copy number calls from posterior marginal distri-butions will likely yield improved insights into CNV as asource of genetic variation and its relationship to disease

Methods

We will briefly review Forward-Backward Gibbs sampling(FBG) for Bayesian Hidden Markov Models and its accel-eration through compression of the data into blocks Byfirst considering the case of equal emission variances amongall states we show that optimal compression is equivalentto a concept called selective wavelet reconstruction follow-ing a classic proof in wavelet theory We then argue thatwavelet coefficient thresholding a variance-dependent mini-max estimator allows for compression even in the case ofunequal emission variances This allows the compression ofthe data to be adapted to the prior variance level at eachsampling iteration We then derive a simple data structureto dynamically create blocks with little overhead Whilewavelet approaches have been used for aCGH data before[25 29 30 63] our method provides the first combinationof wavelets and HMMs

Bayesian Hidden Markov Models

Let T be the length of the observation sequence which isequal to the number of probes An HMM can be representedas a statistical model (qA θ |y) with transition matrixA a latent state sequence q = (q0 q1 qTminus1) an observedemission sequence y = (y0 y1 yTminus1) and the emissionparameters θ The initial distribution of q0 is often includedand denoted as π but in a Bayesian setting it makes moresense to take π as a set of hyperparameters for q0

In the usual frequentist approach the state sequence q isinferred by first finding a maximum likelihood estimate ofthe parameters

(AMLθML) = arg max(A θ )

L (A θ |y)

using the Baum-Welsh algorithm [53 54] This is only guar-anteed to yield local optima as the likelihood function isnot convex Repeated random reinitializations are used tofind ldquogoodrdquo local optima but there are no guarantees forthis method Then the most likely state sequence giventhose parameters

q= argmaxqP(q |AMLθMLy)

is calculated using the Viterbi algorithm [55] However ifthere are only a few aberrations that is there is imbalancebetween classes the ML parameters tend to overfit the nor-mal state which is likely to yield incorrect segmentation

[62] Furthermore alternative segmentations given thoseparameters are also ignored as are the ones for alternativeparameters

The Bayesian approach is to calculate the distribution ofstate sequences directly by integrating out the emission andtransition variables

P(q |y) =int

A

int

θ

P(qA θ |y)dθ dA

Since this integral is intractable it has to be approximatedusing Markov Chain Monte Carlo techniques i e drawingN samples

(q(i)A (i)θ (i))sim P(qA θ |y)

and subsequently approximating marginal state probabili-ties by their frequency in the sample

P(qt = s |y)asymp1N

Nsum

i=1

I(q(i)t = s)

Thus for each position t we get a complete probabilitydistribution over the possible states As the marginals ofeach variable are explicitly defined by conditioning on theother variables an HMM lends itself to Gibbs samplingi e repeatedly sampling from the marginals (q |A θ y)(A |qθ y) and (θ |qA y) conditioned on the previouslysampled values Using Bayesrsquo formula and several condi-tional independence relations the sampling process can bewritten as

A sim P(A |q) prop P(q |A )P(A |πA )

θ sim P(θ |qy) prop P(qy |θ )P(θ |πθ ) and

qsim P(q |A yθ ) prop P(A yθ |q)P(q |πq)

where π represents hyperparameters to the prior distribu-tion Notice that q depends on a prior only in q0 so πq is aset of parameters from which the distribution of q0 is sam-pled (typically πq are parameters of a Dirichlet distribution)In non-Bayesian settings π often denotes P(q0) itself Typi-cally each prior will be conjugate i e it will be the sameclass of distributions as the posterior Thus πS and πA(k)the hyperparameters of q0 and the k-th row ofA will bethe αi of a Dirichlet distribution and πθ = (αβ νmicro0) willbe the parameters of a Normal-Inverse Gamma distribution

Though there are several sampling schemes available[58] has argued strongly in favor of Forward-BackwardGibbs sampling (FBG) [57] Variations of this have beenimplemented for segmentation of aCGH data before [6062 73] However sampling-based Bayesian methods suchas FBG are computationally expensive Recently [62] haveintroduced compressed FBG by sampling over a shortersequence of sufficient statistics of data segments whichare likely to come from the same underlying state LetB = (Bw)Ww=1 be a partition of y into W blocks Each blockBw contains nw elements Let ywk the k-th element in Bw

10

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

The forward variable αw( j) for this block needs to take intoaccount the nw emissions the transitions into state j andthe nw minus 1 self-transitions which yields

αw( j) = Anwminus1j j P(Bw |micro j σ

2j )

nwsum

i=1

αwminus1(i)Ai j and

P(Bw |microσ2) =nwprod

k=1

P(ywk |microσ2)

The ideal block structure would correspond to the actualunknown segmentation of the data Any subdivision thereofwould decrease the compression ratio and thus the speedupbut still allow for recovery of the true breakpoints In addi-tion such a segmentation would yield sufficient statisticsfor the likelihood computation that corresponds to the trueparameters of the state generating a segment Using wavelettheory we show that such a block structure can be easilyobtained

Wavelet theory preliminaries

Here we review some essential wavelet theory for detailssee [74 75] Let

ψ(x) =

1 0le x lt 12

minus1 12 le x lt 1

0 elsewhere

be the Haar wavelet [76] and ψ jk(x) = 2 j2ψ(2 j x minus k) jand k are called the scale and shift parameter Any square-integrable function over the unit interval f isin L2([01)) canbe approximated using the orthonormal basis ψ jk | j k isinZminus1le j 0le k le 2 jminus1 admitting a multiresolution analy-sis [77 78] Essentially this allows us to express a functionf (x) using scaled and shifted copies of one simple basisfunction ψ(x) which is spatially localized i e non-zero ononly a finite interval in x The Haar basis is particularlysuited for expressing piecewise constant functions

Finite data y = (y0 yTminus1) can be treated as an equidis-tant sample f (x) by scaling the indices to the unit intervalusing x t = t

T Let h = log2 T Then y can be expressedexactly as a linear combination over the Haar wavelet basisabove restricted to the maximum level of sampling resolu-tion ( j le hminus 1)

yt =sum

jk

d jkψ jk(x t)

The wavelet transform d =W y is an orthogonal endomor-phism and thus incurs neither redundancy nor loss of in-formation Surprisingly d can be computed in linear timeusing the pyramid algorithm [77]

Compression via wavelet shrinkage

The Haar wavelet transform has an important property con-necting it to block creation Let d be a vector obtained

by setting elements of d to zero then y =W ᵀd = W ᵀd iscalled selective wavelet reconstruction (SW) If all coefficientsd jk with ψ jk(x t) 6= ψ jk(x t+1) are set to zero for some tthen yt = yt+1 which implies a block structure on y Con-versely blocks of size gt 2 (to account for some pathologicalcases) can only be created using SW This general equivalencebetween SW and compression is central to our method Notethat y does not have to be computed explicitly the blockboundaries can be inferred from the position of zero-entriesin d alone

Assume all HMM states had the same emission varianceσ2 Since each state is associated with an emission meanfinding q can be viewed as a regression or smoothing prob-lem of finding an estimate micro of a piecewise constant functionmicro whose range is precisely the set of emission means i e

micro= f (x) y= f (x) + ε εsimiid N(0σ2)

Unfortunately regression methods typically do not limitthe number of distinct values recovered and will insteadreturn some estimate y 6= micro However if y is piecewiseconstant and minimizes microminus y the sample means of eachblock are close to the true emission means This yieldshigh likelihood for their corresponding state and hence astrong posterior distribution leading to fast convergenceFurthermore the change points in micromust be close to changepoints in y since moving block boundaries incurs additionalloss allowing for approximate recovery of true breakpointsy might however induce additional block boundaries thatreduce the compression ratio

In a series of ground-breaking papers Donoho Johnstoneet al [79ndash83] showed that SW could in theory be used as analmost ideal spatially adaptive regression method Assum-ing one could provide an oracle ∆(microy) that would knowthe true micro then there exists a method MSW(y∆) = W ᵀSWusing an optimal subset of wavelet coefficients provided by∆ such that the quadratic risk of ySW = W ᵀSWd is boundedas

microminus ySW22 = O

σ2 ln TT

By definition MSW would be the best compression methodunder the constraints of the Haar wavelet basis This boundis generally unattainable since the oracle cannot be queriedInstead they have shown that for a method MWCT(yλσ)called wavelet coefficient thresholding which sets coefficientsto zero whose absolute value is smaller than some thresholdλσ there exists some λT le

p2 ln T with yWCT = W ᵀWCTd

such that

microminus yWCT22 le (2 ln T + 1)

microminus ySW22 +σ2

T

This λT is minimax i e the maximum risk incured overall possible data sets is not larger than that of any otherthreshold and no better bound can be obtained It is noteasily computed but for large T on the order of tens tohundreds the universal threshold λu

T =p

2 ln T is asymp-totically minimax In other words for data large enough

11

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

to warrant compression universal thresholding is the bestmethod to approximate micro and thus the best wavelet-basedcompression scheme for a given noise level σ2

Integrating wavelet shrinkage into FBG

This compression method can easily be extended to multipleemission variances Since we use a thresholding methoddecreasing the variance simply subdivides existing blocks Ifthe threshold is set to the smallest emission variance amongall states y will approximately preserve the breakpointsaround those low-variance segments Those of high vari-ance are split into blocks of lower sample variance see[84 85] for an analytic expression While the variances forthe different states are not known FBG provides a priorisamples in each iteration We hence propose the follow-ing simple adaptation In each sampling iteration use thesmallest sampled variance parameter to create a new blocksequence via wavelet thresholding (Algorithm 1)

Algorithm 1 Dynamically adaptive FBG for HMMs

1 procedure HAMMLET(y πA πθ πq)2 T larr |y |3 λlarr

p2 ln T

4 A sim P(A |πA )5 θ sim P(θ |πθ )6 for i = 1 N do7 σminlarrminσi

σMADσi |σ2i isin θ

8 Create block sequence B from threshold λσmin9 qsim P(A Bθ |q)P(q |πq)

10 A sim P(q |A )P(A |πA )11 θ sim P(qB |θ )P(θ |πθ )12 Add count of marginal states for q to result13 end for14 end procedure

While intuitively easy to understand provable guaranteesfor the optimality of this method specifically the correspon-dence between the wavelet and the HMM domain remainan open research topic A potential problem could ariseif all sampled variances are too large In this case blockswould be under-segmented yield wrong posterior variancesand hide possible state transitions As a safeguard againstover-compression we use the standard method to estimatethe variance of constant noise in shrinkage applications

σ2MAD =

MADkdlog2 Tminus1k

Φminus1

34

2

as an estimate of the variance in the dominant com-ponent and modify the threshold definition to λ middotmin σMADσi isin θ If the data is not iid σ2

MAD will sys-tematically underestimate the true variance [24] In thiscase the blocks get smaller than necessary thus decreasingthe compression

A data structure for dynamic compression

The necessity to recreate a new block sequence in each it-eration based on the most recent estimate of the smallestvariance parameter creates the challenge of doing so with lit-tle computational overhead specifically without repeatedlycomputing the inverse wavelet transform or considering allT elements in other ways We achieve this by creating asimple tree-based data structure

The pyramid algorithm yields d sorted according to ( j k)Again let h = log2 T and ` = h minus j We can map thewavelet ψ jk to a perfect binary tree of height h such thatall wavelets for scale j are nodes on level ` nodes withineach level are sorted according to k and ` is increasingfrom the leaves to the root (Fig 7) d represents a breadth-first search (BFS) traversal of that tree with d jk being theentry at position b2 jc+ k Adding yi as the i-th leaf on level` = 0 each non-leaf node represents a wavelet which isnon-zero for the n = 2` data points yt for t in the theinterval I jk = [kn (k+ 1)nminus 1] stored in the leaves belownotice that for the leaves kn= t

This implies that the leaves in any subtree all have thesame value after wavelet thresholding if all the wavelets inthis subtree are set to zero We can hence avoid computingthe inverse wavelet transform to create blocks Instead eachnode stores the maximum absolute wavelet coefficient in theentire subtree as well as the sufficient statistics required forcalculating the likelihood function More formally a nodeN`t corresponds to wavelet ψ jk with ` = hminus j and t = k2`

(ψminus10 is simply constant on the [01) interval and has noeffect on block creation thus we discard it) Essentially `numbers the levels beginning at the leaves and t marks thestart position of the block when pruning the subtree rootedat N`t The members stored in each node are

bull The number of leaves corresponding to the block size

N`t[n] = 2`

bull The sum of data points stored in the subtree leaves

N`t[Σ1] =sum

iisinI jk

yi

bull Similarly the sum of squares

N`t[Σ2] =sum

iisinI jk

y2i

bull The maximum absolute wavelet coefficient of the sub-tree including the current d jk itself

N0t[d] = 0 N`gt0t[d] =max`primele`

tlet primeltt+2`

dhminus`prime2`primet prime

All these values can be computed recursively from the childnodes in linear time As some real data sets contain salt-and-pepper noise which manifests as isolated large coefficients

12

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

on the lowest level its is possible to ignore the first level inthe maximum computation so that no information to createa single-element block for outliers is passed up the tree Werefer to this technique as noise control Notice that this doesnot imply that blocks are only created at even t since truetransitions manifest in coefficients on multiple levels

The block creation algorithm is simple upon constructionthe tree is converted to depth-first search (DFS) order whichsimply amounts to sorting the BFS array according to (kn j)Given a threshold the tree is then traversed in DFS order byiterating linearly over the array (Fig 7 solid lines) Once themaximum coefficient stored in a node is less or equal to thethreshold a block of size n is created and the entire subtreeis skipped (dashed lines) As the tree is perfect binary andcomplete the next array position in DFS traversal afterpruning the subtree rooted at the node at index i is simplyobtained as i + 2nminus 1 so no expensive pointer structureneeds to be maintained leaving the tree data structure asimple flat array An example of dynamic block creation isgiven in Fig 8

Once the Gibbs sampler converges to a set of variancesthe block structure is less likely to change To avoid recre-ating the same block structure over and over again weemploy a technique called block structure prediction Sincethe different block structures are subdivisions of each otherthat occur in a specific order for decreasing σ2 there is asimple bijection between the number of blocks and the blockstructure itself Thus for each block sequence length weregister the minimum and maximum variance that createsthat sequence Upon entering a new iteration we checkif the current variance would create the same number ofblocks as in the previous iteration which guarantees thatwe would obtain the same block sequence and hence canavoid recomputation

The wavelet tree data structure can be readily extendedto multivariate data of dimensionality m Instead of storingm different trees and reconciling m different block patternsin each iteration one simply stores m different values foreach sufficient statistic in a tree node Since we have totraverse into the combined tree if the coefficient of any ofthe m trees was below the threshold we simply store thelargest N`t[d] among the corresponding nodes of the treeswhich means that the block creation can be done in O(T)instead of O(mT ) i e the dimensionality of the data onlyenters into the creation of the data structure but not thequery during sampling iterations

Automatic priors

While Bayesian methods allow for inductive bias such asthe expected location of means it is desirable to be able touse our method even when little domain knowledge existsor large variation is expected such as the lab and batcheffects commonly observed in micro-arrays [86] as wellas unknown means due to sample contamination SinceFBG does require a prior even in that case we propose the

0 5 10 15Starting position of block (t)

1

2

4

8

16

Bloc

k si

ze (n

)

DFS array (i+1)wavelet treepruning jumps (i+2n-1)

0

1

2

3

4

Leve

l ()

Figure 7 Mapping of wavelets ψ jk and data points yt to treenodes N`t Each node is the root of a subtree with n= 2` leavespruning that subtree yields a block of size n starting at positiont For instance the node N16 is located at position 13 of the DFSarray (solid line) and corresponds to the wavelet ψ33 A block ofsize n = 2 can be created by pruning the subtree which amountsto advancing by 2nminus 1= 3 positions (dashed line) yielding N38

at position 16 which is the wavelet ψ11 Thus the number ofsteps for creating blocks per iteration is at most the number ofnodes in the tree and thus strictly smaller than 2T

1248

163264

128256

Bloc

k si

ze

0 50 100 150 200 250Position along chromosome

10

05

00

05

10

15

20

25

Mea

sure

men

t

Figure 8 Example of dynamic block creation The data is of sizeT = 256 so the wavelet tree contains 512 nodes Here only 37entries had to be checked against the threshold (dark line) 19of which (round markers) yielded a block (vertical lines on thebottom) Sampling is hence done on a short array of 19 blocksinstead of 256 individual values thus the compression ratio is135 The horizontal lines in the bottom subplot are the blockmeans derived from the sufficient statistics in the nodes Noticehow the algorithm creates small blocks around the breakpointse g at t asymp 125 which requires traversing to lower levels andthus induces some additional blocks in other parts of the tree(left subtree) since all block sizes are powers of 2 This some-what reduces the compression ratio which is unproblematic as itincreases the degrees of freedom in the sampler

13

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

following method to specify hyperparameters of a weak priorautomatically Posterior samples of means and variances aredrawn from a Normal-Inverse Gamma distribution (microσ2)simNIΓ(micro0ναβ) whose marginals simply separate into aNormal and an Inverse Gamma distribution

σ2 sim IΓ(αβ) microsim N

micro0σ2

ν

Let s2 be a user-defined variance (or automatically inferede g from the largest of the finest detail coefficients orσ2

MAD) and p the desired probability to sample a variancenot larger than s2 From the CDF of IΓ we obtain

p = P(σ2 le s2) =Γ

α βs2

Γ (α)=Q

αβ

s2

IΓ has a mean for α gt 1 and closed-form solutions forα isin N Furthermore IΓ has positive skewness for α gt 3We thus let α= 2 which yields

β = minuss2

Wminus1

minuspe

+ 1

0lt p le 1

where Wminus1 is the negative branch of the Lambert W -functionwhich is transcendental However an excellent analyticalapproximation with a maximum error of 0025 is given in[87] which yields

β asymp s2

2p

b

M1

pb+p

2

M2 b exp

M3

pb

+ 1 + b

b = minus ln p

M1 = 03361 M2 = minus00042 M3 = minus00201

Since the mean of IΓ is βαminus1 the expected variance of micro is βν

for α= 2 To ensure proper mixing we could simply set βmicroto the sample variance of the data which can be estimatedfrom the sufficient statistics in the root of the wavelet tree(the first entry in the array) provided that micro contained allstates in almost equal number However due to possibleclass imbalance means for short segments far away from micro0can have low sampling probability as they do not contributemuch to the sample variance of the data We thus define δ tobe the sample variance of block means in the compressionobtained by σ2

MAD and take the maximum of those twovariances We thus obtain

micro0 =Σ1

n and ν= βmax

nΣ2 minusΣ21

n2δ

minus1

Numerical issues

To assure numerical stability many HMM implementationsresort to log-space computations Our implementationwhich differs from [62] reflects the block structure andavoids a considerable number of expensive function calls(exp log pow) required by others This yields an additional

speedup of about one order of magnitude compared to thelog version (data not shown) The term accounting for emis-sions and self-transitions within the block can be writtenas

Anwminus1j j

(2π)nw2σnwj

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

Any constant cancels out during normalization Further-more exponentiation of potentially small numbers causesunderflows We hence move those terms into the exponentutilizing the much stabler logarithm function

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

+ (nw minus 1) log A j j minus nw logσ j

Using the blockrsquos sufficient statistics

nw Σ1 =nwsum

k=1

ywk Σ2 =nwsum

k=1

y2wk

the exponent can be rewritten as

Ew( j) =2micro jΣ1 minusΣ2

2σ2j

+ K(nw j)

K(nw j) = (nw minus 1) log A j j minus nw

logσ j +micro2

j

2σ2j

Note that K(nw j) can be precomputed for each iterationthus greatly reducing the number of expensive function callsTo avoid overflow of the exponential function we subtractthe largest such exponents among all states hence Ew( j)le0 This is equivalent to dividing the forward variables by

exp

maxk

Ew(k)

which cancels out during normalization Hence we obtain

αw( j) = exp

Ew( j)minusmaxk

Ew(k)

nwsum

i=1

αwminus1(i)Ai j

which are then normalized to

αw( j) =αw( j)

sum

k αw(k)

Availability of supporting data

The implementation as well as the supplemental data areavailable at httpbioinformaticsrutgerseduSoftwareHaMMLET

Competing interests

The authors declare that the research was conducted in theabsence of any commercial or financial relationships thatcould be construed as a potential conflict of interest

14

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 9: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

Figure 6 HaMMLETrsquos inference of copy-number segments on T47D breast ductal carcinoma Notice that the data is much morecomplex than the simple structure of a diploid majority class with some small aberrations typically observed for Coriell data

9

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

more computationally efficient We believe this makes itan excellent choice both for routine use in clinical settingsand principled re-analysis of large cohorts where the addedaccuracy and the much improved information about uncer-tainty in copy number calls from posterior marginal distri-butions will likely yield improved insights into CNV as asource of genetic variation and its relationship to disease

Methods

We will briefly review Forward-Backward Gibbs sampling(FBG) for Bayesian Hidden Markov Models and its accel-eration through compression of the data into blocks Byfirst considering the case of equal emission variances amongall states we show that optimal compression is equivalentto a concept called selective wavelet reconstruction follow-ing a classic proof in wavelet theory We then argue thatwavelet coefficient thresholding a variance-dependent mini-max estimator allows for compression even in the case ofunequal emission variances This allows the compression ofthe data to be adapted to the prior variance level at eachsampling iteration We then derive a simple data structureto dynamically create blocks with little overhead Whilewavelet approaches have been used for aCGH data before[25 29 30 63] our method provides the first combinationof wavelets and HMMs

Bayesian Hidden Markov Models

Let T be the length of the observation sequence which isequal to the number of probes An HMM can be representedas a statistical model (qA θ |y) with transition matrixA a latent state sequence q = (q0 q1 qTminus1) an observedemission sequence y = (y0 y1 yTminus1) and the emissionparameters θ The initial distribution of q0 is often includedand denoted as π but in a Bayesian setting it makes moresense to take π as a set of hyperparameters for q0

In the usual frequentist approach the state sequence q isinferred by first finding a maximum likelihood estimate ofthe parameters

(AMLθML) = arg max(A θ )

L (A θ |y)

using the Baum-Welsh algorithm [53 54] This is only guar-anteed to yield local optima as the likelihood function isnot convex Repeated random reinitializations are used tofind ldquogoodrdquo local optima but there are no guarantees forthis method Then the most likely state sequence giventhose parameters

q= argmaxqP(q |AMLθMLy)

is calculated using the Viterbi algorithm [55] However ifthere are only a few aberrations that is there is imbalancebetween classes the ML parameters tend to overfit the nor-mal state which is likely to yield incorrect segmentation

[62] Furthermore alternative segmentations given thoseparameters are also ignored as are the ones for alternativeparameters

The Bayesian approach is to calculate the distribution ofstate sequences directly by integrating out the emission andtransition variables

P(q |y) =int

A

int

θ

P(qA θ |y)dθ dA

Since this integral is intractable it has to be approximatedusing Markov Chain Monte Carlo techniques i e drawingN samples

(q(i)A (i)θ (i))sim P(qA θ |y)

and subsequently approximating marginal state probabili-ties by their frequency in the sample

P(qt = s |y)asymp1N

Nsum

i=1

I(q(i)t = s)

Thus for each position t we get a complete probabilitydistribution over the possible states As the marginals ofeach variable are explicitly defined by conditioning on theother variables an HMM lends itself to Gibbs samplingi e repeatedly sampling from the marginals (q |A θ y)(A |qθ y) and (θ |qA y) conditioned on the previouslysampled values Using Bayesrsquo formula and several condi-tional independence relations the sampling process can bewritten as

A sim P(A |q) prop P(q |A )P(A |πA )

θ sim P(θ |qy) prop P(qy |θ )P(θ |πθ ) and

qsim P(q |A yθ ) prop P(A yθ |q)P(q |πq)

where π represents hyperparameters to the prior distribu-tion Notice that q depends on a prior only in q0 so πq is aset of parameters from which the distribution of q0 is sam-pled (typically πq are parameters of a Dirichlet distribution)In non-Bayesian settings π often denotes P(q0) itself Typi-cally each prior will be conjugate i e it will be the sameclass of distributions as the posterior Thus πS and πA(k)the hyperparameters of q0 and the k-th row ofA will bethe αi of a Dirichlet distribution and πθ = (αβ νmicro0) willbe the parameters of a Normal-Inverse Gamma distribution

Though there are several sampling schemes available[58] has argued strongly in favor of Forward-BackwardGibbs sampling (FBG) [57] Variations of this have beenimplemented for segmentation of aCGH data before [6062 73] However sampling-based Bayesian methods suchas FBG are computationally expensive Recently [62] haveintroduced compressed FBG by sampling over a shortersequence of sufficient statistics of data segments whichare likely to come from the same underlying state LetB = (Bw)Ww=1 be a partition of y into W blocks Each blockBw contains nw elements Let ywk the k-th element in Bw

10

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

The forward variable αw( j) for this block needs to take intoaccount the nw emissions the transitions into state j andthe nw minus 1 self-transitions which yields

αw( j) = Anwminus1j j P(Bw |micro j σ

2j )

nwsum

i=1

αwminus1(i)Ai j and

P(Bw |microσ2) =nwprod

k=1

P(ywk |microσ2)

The ideal block structure would correspond to the actualunknown segmentation of the data Any subdivision thereofwould decrease the compression ratio and thus the speedupbut still allow for recovery of the true breakpoints In addi-tion such a segmentation would yield sufficient statisticsfor the likelihood computation that corresponds to the trueparameters of the state generating a segment Using wavelettheory we show that such a block structure can be easilyobtained

Wavelet theory preliminaries

Here we review some essential wavelet theory for detailssee [74 75] Let

ψ(x) =

1 0le x lt 12

minus1 12 le x lt 1

0 elsewhere

be the Haar wavelet [76] and ψ jk(x) = 2 j2ψ(2 j x minus k) jand k are called the scale and shift parameter Any square-integrable function over the unit interval f isin L2([01)) canbe approximated using the orthonormal basis ψ jk | j k isinZminus1le j 0le k le 2 jminus1 admitting a multiresolution analy-sis [77 78] Essentially this allows us to express a functionf (x) using scaled and shifted copies of one simple basisfunction ψ(x) which is spatially localized i e non-zero ononly a finite interval in x The Haar basis is particularlysuited for expressing piecewise constant functions

Finite data y = (y0 yTminus1) can be treated as an equidis-tant sample f (x) by scaling the indices to the unit intervalusing x t = t

T Let h = log2 T Then y can be expressedexactly as a linear combination over the Haar wavelet basisabove restricted to the maximum level of sampling resolu-tion ( j le hminus 1)

yt =sum

jk

d jkψ jk(x t)

The wavelet transform d =W y is an orthogonal endomor-phism and thus incurs neither redundancy nor loss of in-formation Surprisingly d can be computed in linear timeusing the pyramid algorithm [77]

Compression via wavelet shrinkage

The Haar wavelet transform has an important property con-necting it to block creation Let d be a vector obtained

by setting elements of d to zero then y =W ᵀd = W ᵀd iscalled selective wavelet reconstruction (SW) If all coefficientsd jk with ψ jk(x t) 6= ψ jk(x t+1) are set to zero for some tthen yt = yt+1 which implies a block structure on y Con-versely blocks of size gt 2 (to account for some pathologicalcases) can only be created using SW This general equivalencebetween SW and compression is central to our method Notethat y does not have to be computed explicitly the blockboundaries can be inferred from the position of zero-entriesin d alone

Assume all HMM states had the same emission varianceσ2 Since each state is associated with an emission meanfinding q can be viewed as a regression or smoothing prob-lem of finding an estimate micro of a piecewise constant functionmicro whose range is precisely the set of emission means i e

micro= f (x) y= f (x) + ε εsimiid N(0σ2)

Unfortunately regression methods typically do not limitthe number of distinct values recovered and will insteadreturn some estimate y 6= micro However if y is piecewiseconstant and minimizes microminus y the sample means of eachblock are close to the true emission means This yieldshigh likelihood for their corresponding state and hence astrong posterior distribution leading to fast convergenceFurthermore the change points in micromust be close to changepoints in y since moving block boundaries incurs additionalloss allowing for approximate recovery of true breakpointsy might however induce additional block boundaries thatreduce the compression ratio

In a series of ground-breaking papers Donoho Johnstoneet al [79ndash83] showed that SW could in theory be used as analmost ideal spatially adaptive regression method Assum-ing one could provide an oracle ∆(microy) that would knowthe true micro then there exists a method MSW(y∆) = W ᵀSWusing an optimal subset of wavelet coefficients provided by∆ such that the quadratic risk of ySW = W ᵀSWd is boundedas

microminus ySW22 = O

σ2 ln TT

By definition MSW would be the best compression methodunder the constraints of the Haar wavelet basis This boundis generally unattainable since the oracle cannot be queriedInstead they have shown that for a method MWCT(yλσ)called wavelet coefficient thresholding which sets coefficientsto zero whose absolute value is smaller than some thresholdλσ there exists some λT le

p2 ln T with yWCT = W ᵀWCTd

such that

microminus yWCT22 le (2 ln T + 1)

microminus ySW22 +σ2

T

This λT is minimax i e the maximum risk incured overall possible data sets is not larger than that of any otherthreshold and no better bound can be obtained It is noteasily computed but for large T on the order of tens tohundreds the universal threshold λu

T =p

2 ln T is asymp-totically minimax In other words for data large enough

11

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

to warrant compression universal thresholding is the bestmethod to approximate micro and thus the best wavelet-basedcompression scheme for a given noise level σ2

Integrating wavelet shrinkage into FBG

This compression method can easily be extended to multipleemission variances Since we use a thresholding methoddecreasing the variance simply subdivides existing blocks Ifthe threshold is set to the smallest emission variance amongall states y will approximately preserve the breakpointsaround those low-variance segments Those of high vari-ance are split into blocks of lower sample variance see[84 85] for an analytic expression While the variances forthe different states are not known FBG provides a priorisamples in each iteration We hence propose the follow-ing simple adaptation In each sampling iteration use thesmallest sampled variance parameter to create a new blocksequence via wavelet thresholding (Algorithm 1)

Algorithm 1 Dynamically adaptive FBG for HMMs

1 procedure HAMMLET(y πA πθ πq)2 T larr |y |3 λlarr

p2 ln T

4 A sim P(A |πA )5 θ sim P(θ |πθ )6 for i = 1 N do7 σminlarrminσi

σMADσi |σ2i isin θ

8 Create block sequence B from threshold λσmin9 qsim P(A Bθ |q)P(q |πq)

10 A sim P(q |A )P(A |πA )11 θ sim P(qB |θ )P(θ |πθ )12 Add count of marginal states for q to result13 end for14 end procedure

While intuitively easy to understand provable guaranteesfor the optimality of this method specifically the correspon-dence between the wavelet and the HMM domain remainan open research topic A potential problem could ariseif all sampled variances are too large In this case blockswould be under-segmented yield wrong posterior variancesand hide possible state transitions As a safeguard againstover-compression we use the standard method to estimatethe variance of constant noise in shrinkage applications

σ2MAD =

MADkdlog2 Tminus1k

Φminus1

34

2

as an estimate of the variance in the dominant com-ponent and modify the threshold definition to λ middotmin σMADσi isin θ If the data is not iid σ2

MAD will sys-tematically underestimate the true variance [24] In thiscase the blocks get smaller than necessary thus decreasingthe compression

A data structure for dynamic compression

The necessity to recreate a new block sequence in each it-eration based on the most recent estimate of the smallestvariance parameter creates the challenge of doing so with lit-tle computational overhead specifically without repeatedlycomputing the inverse wavelet transform or considering allT elements in other ways We achieve this by creating asimple tree-based data structure

The pyramid algorithm yields d sorted according to ( j k)Again let h = log2 T and ` = h minus j We can map thewavelet ψ jk to a perfect binary tree of height h such thatall wavelets for scale j are nodes on level ` nodes withineach level are sorted according to k and ` is increasingfrom the leaves to the root (Fig 7) d represents a breadth-first search (BFS) traversal of that tree with d jk being theentry at position b2 jc+ k Adding yi as the i-th leaf on level` = 0 each non-leaf node represents a wavelet which isnon-zero for the n = 2` data points yt for t in the theinterval I jk = [kn (k+ 1)nminus 1] stored in the leaves belownotice that for the leaves kn= t

This implies that the leaves in any subtree all have thesame value after wavelet thresholding if all the wavelets inthis subtree are set to zero We can hence avoid computingthe inverse wavelet transform to create blocks Instead eachnode stores the maximum absolute wavelet coefficient in theentire subtree as well as the sufficient statistics required forcalculating the likelihood function More formally a nodeN`t corresponds to wavelet ψ jk with ` = hminus j and t = k2`

(ψminus10 is simply constant on the [01) interval and has noeffect on block creation thus we discard it) Essentially `numbers the levels beginning at the leaves and t marks thestart position of the block when pruning the subtree rootedat N`t The members stored in each node are

bull The number of leaves corresponding to the block size

N`t[n] = 2`

bull The sum of data points stored in the subtree leaves

N`t[Σ1] =sum

iisinI jk

yi

bull Similarly the sum of squares

N`t[Σ2] =sum

iisinI jk

y2i

bull The maximum absolute wavelet coefficient of the sub-tree including the current d jk itself

N0t[d] = 0 N`gt0t[d] =max`primele`

tlet primeltt+2`

dhminus`prime2`primet prime

All these values can be computed recursively from the childnodes in linear time As some real data sets contain salt-and-pepper noise which manifests as isolated large coefficients

12

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

on the lowest level its is possible to ignore the first level inthe maximum computation so that no information to createa single-element block for outliers is passed up the tree Werefer to this technique as noise control Notice that this doesnot imply that blocks are only created at even t since truetransitions manifest in coefficients on multiple levels

The block creation algorithm is simple upon constructionthe tree is converted to depth-first search (DFS) order whichsimply amounts to sorting the BFS array according to (kn j)Given a threshold the tree is then traversed in DFS order byiterating linearly over the array (Fig 7 solid lines) Once themaximum coefficient stored in a node is less or equal to thethreshold a block of size n is created and the entire subtreeis skipped (dashed lines) As the tree is perfect binary andcomplete the next array position in DFS traversal afterpruning the subtree rooted at the node at index i is simplyobtained as i + 2nminus 1 so no expensive pointer structureneeds to be maintained leaving the tree data structure asimple flat array An example of dynamic block creation isgiven in Fig 8

Once the Gibbs sampler converges to a set of variancesthe block structure is less likely to change To avoid recre-ating the same block structure over and over again weemploy a technique called block structure prediction Sincethe different block structures are subdivisions of each otherthat occur in a specific order for decreasing σ2 there is asimple bijection between the number of blocks and the blockstructure itself Thus for each block sequence length weregister the minimum and maximum variance that createsthat sequence Upon entering a new iteration we checkif the current variance would create the same number ofblocks as in the previous iteration which guarantees thatwe would obtain the same block sequence and hence canavoid recomputation

The wavelet tree data structure can be readily extendedto multivariate data of dimensionality m Instead of storingm different trees and reconciling m different block patternsin each iteration one simply stores m different values foreach sufficient statistic in a tree node Since we have totraverse into the combined tree if the coefficient of any ofthe m trees was below the threshold we simply store thelargest N`t[d] among the corresponding nodes of the treeswhich means that the block creation can be done in O(T)instead of O(mT ) i e the dimensionality of the data onlyenters into the creation of the data structure but not thequery during sampling iterations

Automatic priors

While Bayesian methods allow for inductive bias such asthe expected location of means it is desirable to be able touse our method even when little domain knowledge existsor large variation is expected such as the lab and batcheffects commonly observed in micro-arrays [86] as wellas unknown means due to sample contamination SinceFBG does require a prior even in that case we propose the

0 5 10 15Starting position of block (t)

1

2

4

8

16

Bloc

k si

ze (n

)

DFS array (i+1)wavelet treepruning jumps (i+2n-1)

0

1

2

3

4

Leve

l ()

Figure 7 Mapping of wavelets ψ jk and data points yt to treenodes N`t Each node is the root of a subtree with n= 2` leavespruning that subtree yields a block of size n starting at positiont For instance the node N16 is located at position 13 of the DFSarray (solid line) and corresponds to the wavelet ψ33 A block ofsize n = 2 can be created by pruning the subtree which amountsto advancing by 2nminus 1= 3 positions (dashed line) yielding N38

at position 16 which is the wavelet ψ11 Thus the number ofsteps for creating blocks per iteration is at most the number ofnodes in the tree and thus strictly smaller than 2T

1248

163264

128256

Bloc

k si

ze

0 50 100 150 200 250Position along chromosome

10

05

00

05

10

15

20

25

Mea

sure

men

t

Figure 8 Example of dynamic block creation The data is of sizeT = 256 so the wavelet tree contains 512 nodes Here only 37entries had to be checked against the threshold (dark line) 19of which (round markers) yielded a block (vertical lines on thebottom) Sampling is hence done on a short array of 19 blocksinstead of 256 individual values thus the compression ratio is135 The horizontal lines in the bottom subplot are the blockmeans derived from the sufficient statistics in the nodes Noticehow the algorithm creates small blocks around the breakpointse g at t asymp 125 which requires traversing to lower levels andthus induces some additional blocks in other parts of the tree(left subtree) since all block sizes are powers of 2 This some-what reduces the compression ratio which is unproblematic as itincreases the degrees of freedom in the sampler

13

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

following method to specify hyperparameters of a weak priorautomatically Posterior samples of means and variances aredrawn from a Normal-Inverse Gamma distribution (microσ2)simNIΓ(micro0ναβ) whose marginals simply separate into aNormal and an Inverse Gamma distribution

σ2 sim IΓ(αβ) microsim N

micro0σ2

ν

Let s2 be a user-defined variance (or automatically inferede g from the largest of the finest detail coefficients orσ2

MAD) and p the desired probability to sample a variancenot larger than s2 From the CDF of IΓ we obtain

p = P(σ2 le s2) =Γ

α βs2

Γ (α)=Q

αβ

s2

IΓ has a mean for α gt 1 and closed-form solutions forα isin N Furthermore IΓ has positive skewness for α gt 3We thus let α= 2 which yields

β = minuss2

Wminus1

minuspe

+ 1

0lt p le 1

where Wminus1 is the negative branch of the Lambert W -functionwhich is transcendental However an excellent analyticalapproximation with a maximum error of 0025 is given in[87] which yields

β asymp s2

2p

b

M1

pb+p

2

M2 b exp

M3

pb

+ 1 + b

b = minus ln p

M1 = 03361 M2 = minus00042 M3 = minus00201

Since the mean of IΓ is βαminus1 the expected variance of micro is βν

for α= 2 To ensure proper mixing we could simply set βmicroto the sample variance of the data which can be estimatedfrom the sufficient statistics in the root of the wavelet tree(the first entry in the array) provided that micro contained allstates in almost equal number However due to possibleclass imbalance means for short segments far away from micro0can have low sampling probability as they do not contributemuch to the sample variance of the data We thus define δ tobe the sample variance of block means in the compressionobtained by σ2

MAD and take the maximum of those twovariances We thus obtain

micro0 =Σ1

n and ν= βmax

nΣ2 minusΣ21

n2δ

minus1

Numerical issues

To assure numerical stability many HMM implementationsresort to log-space computations Our implementationwhich differs from [62] reflects the block structure andavoids a considerable number of expensive function calls(exp log pow) required by others This yields an additional

speedup of about one order of magnitude compared to thelog version (data not shown) The term accounting for emis-sions and self-transitions within the block can be writtenas

Anwminus1j j

(2π)nw2σnwj

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

Any constant cancels out during normalization Further-more exponentiation of potentially small numbers causesunderflows We hence move those terms into the exponentutilizing the much stabler logarithm function

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

+ (nw minus 1) log A j j minus nw logσ j

Using the blockrsquos sufficient statistics

nw Σ1 =nwsum

k=1

ywk Σ2 =nwsum

k=1

y2wk

the exponent can be rewritten as

Ew( j) =2micro jΣ1 minusΣ2

2σ2j

+ K(nw j)

K(nw j) = (nw minus 1) log A j j minus nw

logσ j +micro2

j

2σ2j

Note that K(nw j) can be precomputed for each iterationthus greatly reducing the number of expensive function callsTo avoid overflow of the exponential function we subtractthe largest such exponents among all states hence Ew( j)le0 This is equivalent to dividing the forward variables by

exp

maxk

Ew(k)

which cancels out during normalization Hence we obtain

αw( j) = exp

Ew( j)minusmaxk

Ew(k)

nwsum

i=1

αwminus1(i)Ai j

which are then normalized to

αw( j) =αw( j)

sum

k αw(k)

Availability of supporting data

The implementation as well as the supplemental data areavailable at httpbioinformaticsrutgerseduSoftwareHaMMLET

Competing interests

The authors declare that the research was conducted in theabsence of any commercial or financial relationships thatcould be construed as a potential conflict of interest

14

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 10: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

more computationally efficient We believe this makes itan excellent choice both for routine use in clinical settingsand principled re-analysis of large cohorts where the addedaccuracy and the much improved information about uncer-tainty in copy number calls from posterior marginal distri-butions will likely yield improved insights into CNV as asource of genetic variation and its relationship to disease

Methods

We will briefly review Forward-Backward Gibbs sampling(FBG) for Bayesian Hidden Markov Models and its accel-eration through compression of the data into blocks Byfirst considering the case of equal emission variances amongall states we show that optimal compression is equivalentto a concept called selective wavelet reconstruction follow-ing a classic proof in wavelet theory We then argue thatwavelet coefficient thresholding a variance-dependent mini-max estimator allows for compression even in the case ofunequal emission variances This allows the compression ofthe data to be adapted to the prior variance level at eachsampling iteration We then derive a simple data structureto dynamically create blocks with little overhead Whilewavelet approaches have been used for aCGH data before[25 29 30 63] our method provides the first combinationof wavelets and HMMs

Bayesian Hidden Markov Models

Let T be the length of the observation sequence which isequal to the number of probes An HMM can be representedas a statistical model (qA θ |y) with transition matrixA a latent state sequence q = (q0 q1 qTminus1) an observedemission sequence y = (y0 y1 yTminus1) and the emissionparameters θ The initial distribution of q0 is often includedand denoted as π but in a Bayesian setting it makes moresense to take π as a set of hyperparameters for q0

In the usual frequentist approach the state sequence q isinferred by first finding a maximum likelihood estimate ofthe parameters

(AMLθML) = arg max(A θ )

L (A θ |y)

using the Baum-Welsh algorithm [53 54] This is only guar-anteed to yield local optima as the likelihood function isnot convex Repeated random reinitializations are used tofind ldquogoodrdquo local optima but there are no guarantees forthis method Then the most likely state sequence giventhose parameters

q= argmaxqP(q |AMLθMLy)

is calculated using the Viterbi algorithm [55] However ifthere are only a few aberrations that is there is imbalancebetween classes the ML parameters tend to overfit the nor-mal state which is likely to yield incorrect segmentation

[62] Furthermore alternative segmentations given thoseparameters are also ignored as are the ones for alternativeparameters

The Bayesian approach is to calculate the distribution ofstate sequences directly by integrating out the emission andtransition variables

P(q |y) =int

A

int

θ

P(qA θ |y)dθ dA

Since this integral is intractable it has to be approximatedusing Markov Chain Monte Carlo techniques i e drawingN samples

(q(i)A (i)θ (i))sim P(qA θ |y)

and subsequently approximating marginal state probabili-ties by their frequency in the sample

P(qt = s |y)asymp1N

Nsum

i=1

I(q(i)t = s)

Thus for each position t we get a complete probabilitydistribution over the possible states As the marginals ofeach variable are explicitly defined by conditioning on theother variables an HMM lends itself to Gibbs samplingi e repeatedly sampling from the marginals (q |A θ y)(A |qθ y) and (θ |qA y) conditioned on the previouslysampled values Using Bayesrsquo formula and several condi-tional independence relations the sampling process can bewritten as

A sim P(A |q) prop P(q |A )P(A |πA )

θ sim P(θ |qy) prop P(qy |θ )P(θ |πθ ) and

qsim P(q |A yθ ) prop P(A yθ |q)P(q |πq)

where π represents hyperparameters to the prior distribu-tion Notice that q depends on a prior only in q0 so πq is aset of parameters from which the distribution of q0 is sam-pled (typically πq are parameters of a Dirichlet distribution)In non-Bayesian settings π often denotes P(q0) itself Typi-cally each prior will be conjugate i e it will be the sameclass of distributions as the posterior Thus πS and πA(k)the hyperparameters of q0 and the k-th row ofA will bethe αi of a Dirichlet distribution and πθ = (αβ νmicro0) willbe the parameters of a Normal-Inverse Gamma distribution

Though there are several sampling schemes available[58] has argued strongly in favor of Forward-BackwardGibbs sampling (FBG) [57] Variations of this have beenimplemented for segmentation of aCGH data before [6062 73] However sampling-based Bayesian methods suchas FBG are computationally expensive Recently [62] haveintroduced compressed FBG by sampling over a shortersequence of sufficient statistics of data segments whichare likely to come from the same underlying state LetB = (Bw)Ww=1 be a partition of y into W blocks Each blockBw contains nw elements Let ywk the k-th element in Bw

10

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

The forward variable αw( j) for this block needs to take intoaccount the nw emissions the transitions into state j andthe nw minus 1 self-transitions which yields

αw( j) = Anwminus1j j P(Bw |micro j σ

2j )

nwsum

i=1

αwminus1(i)Ai j and

P(Bw |microσ2) =nwprod

k=1

P(ywk |microσ2)

The ideal block structure would correspond to the actualunknown segmentation of the data Any subdivision thereofwould decrease the compression ratio and thus the speedupbut still allow for recovery of the true breakpoints In addi-tion such a segmentation would yield sufficient statisticsfor the likelihood computation that corresponds to the trueparameters of the state generating a segment Using wavelettheory we show that such a block structure can be easilyobtained

Wavelet theory preliminaries

Here we review some essential wavelet theory for detailssee [74 75] Let

ψ(x) =

1 0le x lt 12

minus1 12 le x lt 1

0 elsewhere

be the Haar wavelet [76] and ψ jk(x) = 2 j2ψ(2 j x minus k) jand k are called the scale and shift parameter Any square-integrable function over the unit interval f isin L2([01)) canbe approximated using the orthonormal basis ψ jk | j k isinZminus1le j 0le k le 2 jminus1 admitting a multiresolution analy-sis [77 78] Essentially this allows us to express a functionf (x) using scaled and shifted copies of one simple basisfunction ψ(x) which is spatially localized i e non-zero ononly a finite interval in x The Haar basis is particularlysuited for expressing piecewise constant functions

Finite data y = (y0 yTminus1) can be treated as an equidis-tant sample f (x) by scaling the indices to the unit intervalusing x t = t

T Let h = log2 T Then y can be expressedexactly as a linear combination over the Haar wavelet basisabove restricted to the maximum level of sampling resolu-tion ( j le hminus 1)

yt =sum

jk

d jkψ jk(x t)

The wavelet transform d =W y is an orthogonal endomor-phism and thus incurs neither redundancy nor loss of in-formation Surprisingly d can be computed in linear timeusing the pyramid algorithm [77]

Compression via wavelet shrinkage

The Haar wavelet transform has an important property con-necting it to block creation Let d be a vector obtained

by setting elements of d to zero then y =W ᵀd = W ᵀd iscalled selective wavelet reconstruction (SW) If all coefficientsd jk with ψ jk(x t) 6= ψ jk(x t+1) are set to zero for some tthen yt = yt+1 which implies a block structure on y Con-versely blocks of size gt 2 (to account for some pathologicalcases) can only be created using SW This general equivalencebetween SW and compression is central to our method Notethat y does not have to be computed explicitly the blockboundaries can be inferred from the position of zero-entriesin d alone

Assume all HMM states had the same emission varianceσ2 Since each state is associated with an emission meanfinding q can be viewed as a regression or smoothing prob-lem of finding an estimate micro of a piecewise constant functionmicro whose range is precisely the set of emission means i e

micro= f (x) y= f (x) + ε εsimiid N(0σ2)

Unfortunately regression methods typically do not limitthe number of distinct values recovered and will insteadreturn some estimate y 6= micro However if y is piecewiseconstant and minimizes microminus y the sample means of eachblock are close to the true emission means This yieldshigh likelihood for their corresponding state and hence astrong posterior distribution leading to fast convergenceFurthermore the change points in micromust be close to changepoints in y since moving block boundaries incurs additionalloss allowing for approximate recovery of true breakpointsy might however induce additional block boundaries thatreduce the compression ratio

In a series of ground-breaking papers Donoho Johnstoneet al [79ndash83] showed that SW could in theory be used as analmost ideal spatially adaptive regression method Assum-ing one could provide an oracle ∆(microy) that would knowthe true micro then there exists a method MSW(y∆) = W ᵀSWusing an optimal subset of wavelet coefficients provided by∆ such that the quadratic risk of ySW = W ᵀSWd is boundedas

microminus ySW22 = O

σ2 ln TT

By definition MSW would be the best compression methodunder the constraints of the Haar wavelet basis This boundis generally unattainable since the oracle cannot be queriedInstead they have shown that for a method MWCT(yλσ)called wavelet coefficient thresholding which sets coefficientsto zero whose absolute value is smaller than some thresholdλσ there exists some λT le

p2 ln T with yWCT = W ᵀWCTd

such that

microminus yWCT22 le (2 ln T + 1)

microminus ySW22 +σ2

T

This λT is minimax i e the maximum risk incured overall possible data sets is not larger than that of any otherthreshold and no better bound can be obtained It is noteasily computed but for large T on the order of tens tohundreds the universal threshold λu

T =p

2 ln T is asymp-totically minimax In other words for data large enough

11

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

to warrant compression universal thresholding is the bestmethod to approximate micro and thus the best wavelet-basedcompression scheme for a given noise level σ2

Integrating wavelet shrinkage into FBG

This compression method can easily be extended to multipleemission variances Since we use a thresholding methoddecreasing the variance simply subdivides existing blocks Ifthe threshold is set to the smallest emission variance amongall states y will approximately preserve the breakpointsaround those low-variance segments Those of high vari-ance are split into blocks of lower sample variance see[84 85] for an analytic expression While the variances forthe different states are not known FBG provides a priorisamples in each iteration We hence propose the follow-ing simple adaptation In each sampling iteration use thesmallest sampled variance parameter to create a new blocksequence via wavelet thresholding (Algorithm 1)

Algorithm 1 Dynamically adaptive FBG for HMMs

1 procedure HAMMLET(y πA πθ πq)2 T larr |y |3 λlarr

p2 ln T

4 A sim P(A |πA )5 θ sim P(θ |πθ )6 for i = 1 N do7 σminlarrminσi

σMADσi |σ2i isin θ

8 Create block sequence B from threshold λσmin9 qsim P(A Bθ |q)P(q |πq)

10 A sim P(q |A )P(A |πA )11 θ sim P(qB |θ )P(θ |πθ )12 Add count of marginal states for q to result13 end for14 end procedure

While intuitively easy to understand provable guaranteesfor the optimality of this method specifically the correspon-dence between the wavelet and the HMM domain remainan open research topic A potential problem could ariseif all sampled variances are too large In this case blockswould be under-segmented yield wrong posterior variancesand hide possible state transitions As a safeguard againstover-compression we use the standard method to estimatethe variance of constant noise in shrinkage applications

σ2MAD =

MADkdlog2 Tminus1k

Φminus1

34

2

as an estimate of the variance in the dominant com-ponent and modify the threshold definition to λ middotmin σMADσi isin θ If the data is not iid σ2

MAD will sys-tematically underestimate the true variance [24] In thiscase the blocks get smaller than necessary thus decreasingthe compression

A data structure for dynamic compression

The necessity to recreate a new block sequence in each it-eration based on the most recent estimate of the smallestvariance parameter creates the challenge of doing so with lit-tle computational overhead specifically without repeatedlycomputing the inverse wavelet transform or considering allT elements in other ways We achieve this by creating asimple tree-based data structure

The pyramid algorithm yields d sorted according to ( j k)Again let h = log2 T and ` = h minus j We can map thewavelet ψ jk to a perfect binary tree of height h such thatall wavelets for scale j are nodes on level ` nodes withineach level are sorted according to k and ` is increasingfrom the leaves to the root (Fig 7) d represents a breadth-first search (BFS) traversal of that tree with d jk being theentry at position b2 jc+ k Adding yi as the i-th leaf on level` = 0 each non-leaf node represents a wavelet which isnon-zero for the n = 2` data points yt for t in the theinterval I jk = [kn (k+ 1)nminus 1] stored in the leaves belownotice that for the leaves kn= t

This implies that the leaves in any subtree all have thesame value after wavelet thresholding if all the wavelets inthis subtree are set to zero We can hence avoid computingthe inverse wavelet transform to create blocks Instead eachnode stores the maximum absolute wavelet coefficient in theentire subtree as well as the sufficient statistics required forcalculating the likelihood function More formally a nodeN`t corresponds to wavelet ψ jk with ` = hminus j and t = k2`

(ψminus10 is simply constant on the [01) interval and has noeffect on block creation thus we discard it) Essentially `numbers the levels beginning at the leaves and t marks thestart position of the block when pruning the subtree rootedat N`t The members stored in each node are

bull The number of leaves corresponding to the block size

N`t[n] = 2`

bull The sum of data points stored in the subtree leaves

N`t[Σ1] =sum

iisinI jk

yi

bull Similarly the sum of squares

N`t[Σ2] =sum

iisinI jk

y2i

bull The maximum absolute wavelet coefficient of the sub-tree including the current d jk itself

N0t[d] = 0 N`gt0t[d] =max`primele`

tlet primeltt+2`

dhminus`prime2`primet prime

All these values can be computed recursively from the childnodes in linear time As some real data sets contain salt-and-pepper noise which manifests as isolated large coefficients

12

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

on the lowest level its is possible to ignore the first level inthe maximum computation so that no information to createa single-element block for outliers is passed up the tree Werefer to this technique as noise control Notice that this doesnot imply that blocks are only created at even t since truetransitions manifest in coefficients on multiple levels

The block creation algorithm is simple upon constructionthe tree is converted to depth-first search (DFS) order whichsimply amounts to sorting the BFS array according to (kn j)Given a threshold the tree is then traversed in DFS order byiterating linearly over the array (Fig 7 solid lines) Once themaximum coefficient stored in a node is less or equal to thethreshold a block of size n is created and the entire subtreeis skipped (dashed lines) As the tree is perfect binary andcomplete the next array position in DFS traversal afterpruning the subtree rooted at the node at index i is simplyobtained as i + 2nminus 1 so no expensive pointer structureneeds to be maintained leaving the tree data structure asimple flat array An example of dynamic block creation isgiven in Fig 8

Once the Gibbs sampler converges to a set of variancesthe block structure is less likely to change To avoid recre-ating the same block structure over and over again weemploy a technique called block structure prediction Sincethe different block structures are subdivisions of each otherthat occur in a specific order for decreasing σ2 there is asimple bijection between the number of blocks and the blockstructure itself Thus for each block sequence length weregister the minimum and maximum variance that createsthat sequence Upon entering a new iteration we checkif the current variance would create the same number ofblocks as in the previous iteration which guarantees thatwe would obtain the same block sequence and hence canavoid recomputation

The wavelet tree data structure can be readily extendedto multivariate data of dimensionality m Instead of storingm different trees and reconciling m different block patternsin each iteration one simply stores m different values foreach sufficient statistic in a tree node Since we have totraverse into the combined tree if the coefficient of any ofthe m trees was below the threshold we simply store thelargest N`t[d] among the corresponding nodes of the treeswhich means that the block creation can be done in O(T)instead of O(mT ) i e the dimensionality of the data onlyenters into the creation of the data structure but not thequery during sampling iterations

Automatic priors

While Bayesian methods allow for inductive bias such asthe expected location of means it is desirable to be able touse our method even when little domain knowledge existsor large variation is expected such as the lab and batcheffects commonly observed in micro-arrays [86] as wellas unknown means due to sample contamination SinceFBG does require a prior even in that case we propose the

0 5 10 15Starting position of block (t)

1

2

4

8

16

Bloc

k si

ze (n

)

DFS array (i+1)wavelet treepruning jumps (i+2n-1)

0

1

2

3

4

Leve

l ()

Figure 7 Mapping of wavelets ψ jk and data points yt to treenodes N`t Each node is the root of a subtree with n= 2` leavespruning that subtree yields a block of size n starting at positiont For instance the node N16 is located at position 13 of the DFSarray (solid line) and corresponds to the wavelet ψ33 A block ofsize n = 2 can be created by pruning the subtree which amountsto advancing by 2nminus 1= 3 positions (dashed line) yielding N38

at position 16 which is the wavelet ψ11 Thus the number ofsteps for creating blocks per iteration is at most the number ofnodes in the tree and thus strictly smaller than 2T

1248

163264

128256

Bloc

k si

ze

0 50 100 150 200 250Position along chromosome

10

05

00

05

10

15

20

25

Mea

sure

men

t

Figure 8 Example of dynamic block creation The data is of sizeT = 256 so the wavelet tree contains 512 nodes Here only 37entries had to be checked against the threshold (dark line) 19of which (round markers) yielded a block (vertical lines on thebottom) Sampling is hence done on a short array of 19 blocksinstead of 256 individual values thus the compression ratio is135 The horizontal lines in the bottom subplot are the blockmeans derived from the sufficient statistics in the nodes Noticehow the algorithm creates small blocks around the breakpointse g at t asymp 125 which requires traversing to lower levels andthus induces some additional blocks in other parts of the tree(left subtree) since all block sizes are powers of 2 This some-what reduces the compression ratio which is unproblematic as itincreases the degrees of freedom in the sampler

13

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

following method to specify hyperparameters of a weak priorautomatically Posterior samples of means and variances aredrawn from a Normal-Inverse Gamma distribution (microσ2)simNIΓ(micro0ναβ) whose marginals simply separate into aNormal and an Inverse Gamma distribution

σ2 sim IΓ(αβ) microsim N

micro0σ2

ν

Let s2 be a user-defined variance (or automatically inferede g from the largest of the finest detail coefficients orσ2

MAD) and p the desired probability to sample a variancenot larger than s2 From the CDF of IΓ we obtain

p = P(σ2 le s2) =Γ

α βs2

Γ (α)=Q

αβ

s2

IΓ has a mean for α gt 1 and closed-form solutions forα isin N Furthermore IΓ has positive skewness for α gt 3We thus let α= 2 which yields

β = minuss2

Wminus1

minuspe

+ 1

0lt p le 1

where Wminus1 is the negative branch of the Lambert W -functionwhich is transcendental However an excellent analyticalapproximation with a maximum error of 0025 is given in[87] which yields

β asymp s2

2p

b

M1

pb+p

2

M2 b exp

M3

pb

+ 1 + b

b = minus ln p

M1 = 03361 M2 = minus00042 M3 = minus00201

Since the mean of IΓ is βαminus1 the expected variance of micro is βν

for α= 2 To ensure proper mixing we could simply set βmicroto the sample variance of the data which can be estimatedfrom the sufficient statistics in the root of the wavelet tree(the first entry in the array) provided that micro contained allstates in almost equal number However due to possibleclass imbalance means for short segments far away from micro0can have low sampling probability as they do not contributemuch to the sample variance of the data We thus define δ tobe the sample variance of block means in the compressionobtained by σ2

MAD and take the maximum of those twovariances We thus obtain

micro0 =Σ1

n and ν= βmax

nΣ2 minusΣ21

n2δ

minus1

Numerical issues

To assure numerical stability many HMM implementationsresort to log-space computations Our implementationwhich differs from [62] reflects the block structure andavoids a considerable number of expensive function calls(exp log pow) required by others This yields an additional

speedup of about one order of magnitude compared to thelog version (data not shown) The term accounting for emis-sions and self-transitions within the block can be writtenas

Anwminus1j j

(2π)nw2σnwj

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

Any constant cancels out during normalization Further-more exponentiation of potentially small numbers causesunderflows We hence move those terms into the exponentutilizing the much stabler logarithm function

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

+ (nw minus 1) log A j j minus nw logσ j

Using the blockrsquos sufficient statistics

nw Σ1 =nwsum

k=1

ywk Σ2 =nwsum

k=1

y2wk

the exponent can be rewritten as

Ew( j) =2micro jΣ1 minusΣ2

2σ2j

+ K(nw j)

K(nw j) = (nw minus 1) log A j j minus nw

logσ j +micro2

j

2σ2j

Note that K(nw j) can be precomputed for each iterationthus greatly reducing the number of expensive function callsTo avoid overflow of the exponential function we subtractthe largest such exponents among all states hence Ew( j)le0 This is equivalent to dividing the forward variables by

exp

maxk

Ew(k)

which cancels out during normalization Hence we obtain

αw( j) = exp

Ew( j)minusmaxk

Ew(k)

nwsum

i=1

αwminus1(i)Ai j

which are then normalized to

αw( j) =αw( j)

sum

k αw(k)

Availability of supporting data

The implementation as well as the supplemental data areavailable at httpbioinformaticsrutgerseduSoftwareHaMMLET

Competing interests

The authors declare that the research was conducted in theabsence of any commercial or financial relationships thatcould be construed as a potential conflict of interest

14

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 11: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

The forward variable αw( j) for this block needs to take intoaccount the nw emissions the transitions into state j andthe nw minus 1 self-transitions which yields

αw( j) = Anwminus1j j P(Bw |micro j σ

2j )

nwsum

i=1

αwminus1(i)Ai j and

P(Bw |microσ2) =nwprod

k=1

P(ywk |microσ2)

The ideal block structure would correspond to the actualunknown segmentation of the data Any subdivision thereofwould decrease the compression ratio and thus the speedupbut still allow for recovery of the true breakpoints In addi-tion such a segmentation would yield sufficient statisticsfor the likelihood computation that corresponds to the trueparameters of the state generating a segment Using wavelettheory we show that such a block structure can be easilyobtained

Wavelet theory preliminaries

Here we review some essential wavelet theory for detailssee [74 75] Let

ψ(x) =

1 0le x lt 12

minus1 12 le x lt 1

0 elsewhere

be the Haar wavelet [76] and ψ jk(x) = 2 j2ψ(2 j x minus k) jand k are called the scale and shift parameter Any square-integrable function over the unit interval f isin L2([01)) canbe approximated using the orthonormal basis ψ jk | j k isinZminus1le j 0le k le 2 jminus1 admitting a multiresolution analy-sis [77 78] Essentially this allows us to express a functionf (x) using scaled and shifted copies of one simple basisfunction ψ(x) which is spatially localized i e non-zero ononly a finite interval in x The Haar basis is particularlysuited for expressing piecewise constant functions

Finite data y = (y0 yTminus1) can be treated as an equidis-tant sample f (x) by scaling the indices to the unit intervalusing x t = t

T Let h = log2 T Then y can be expressedexactly as a linear combination over the Haar wavelet basisabove restricted to the maximum level of sampling resolu-tion ( j le hminus 1)

yt =sum

jk

d jkψ jk(x t)

The wavelet transform d =W y is an orthogonal endomor-phism and thus incurs neither redundancy nor loss of in-formation Surprisingly d can be computed in linear timeusing the pyramid algorithm [77]

Compression via wavelet shrinkage

The Haar wavelet transform has an important property con-necting it to block creation Let d be a vector obtained

by setting elements of d to zero then y =W ᵀd = W ᵀd iscalled selective wavelet reconstruction (SW) If all coefficientsd jk with ψ jk(x t) 6= ψ jk(x t+1) are set to zero for some tthen yt = yt+1 which implies a block structure on y Con-versely blocks of size gt 2 (to account for some pathologicalcases) can only be created using SW This general equivalencebetween SW and compression is central to our method Notethat y does not have to be computed explicitly the blockboundaries can be inferred from the position of zero-entriesin d alone

Assume all HMM states had the same emission varianceσ2 Since each state is associated with an emission meanfinding q can be viewed as a regression or smoothing prob-lem of finding an estimate micro of a piecewise constant functionmicro whose range is precisely the set of emission means i e

micro= f (x) y= f (x) + ε εsimiid N(0σ2)

Unfortunately regression methods typically do not limitthe number of distinct values recovered and will insteadreturn some estimate y 6= micro However if y is piecewiseconstant and minimizes microminus y the sample means of eachblock are close to the true emission means This yieldshigh likelihood for their corresponding state and hence astrong posterior distribution leading to fast convergenceFurthermore the change points in micromust be close to changepoints in y since moving block boundaries incurs additionalloss allowing for approximate recovery of true breakpointsy might however induce additional block boundaries thatreduce the compression ratio

In a series of ground-breaking papers Donoho Johnstoneet al [79ndash83] showed that SW could in theory be used as analmost ideal spatially adaptive regression method Assum-ing one could provide an oracle ∆(microy) that would knowthe true micro then there exists a method MSW(y∆) = W ᵀSWusing an optimal subset of wavelet coefficients provided by∆ such that the quadratic risk of ySW = W ᵀSWd is boundedas

microminus ySW22 = O

σ2 ln TT

By definition MSW would be the best compression methodunder the constraints of the Haar wavelet basis This boundis generally unattainable since the oracle cannot be queriedInstead they have shown that for a method MWCT(yλσ)called wavelet coefficient thresholding which sets coefficientsto zero whose absolute value is smaller than some thresholdλσ there exists some λT le

p2 ln T with yWCT = W ᵀWCTd

such that

microminus yWCT22 le (2 ln T + 1)

microminus ySW22 +σ2

T

This λT is minimax i e the maximum risk incured overall possible data sets is not larger than that of any otherthreshold and no better bound can be obtained It is noteasily computed but for large T on the order of tens tohundreds the universal threshold λu

T =p

2 ln T is asymp-totically minimax In other words for data large enough

11

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

to warrant compression universal thresholding is the bestmethod to approximate micro and thus the best wavelet-basedcompression scheme for a given noise level σ2

Integrating wavelet shrinkage into FBG

This compression method can easily be extended to multipleemission variances Since we use a thresholding methoddecreasing the variance simply subdivides existing blocks Ifthe threshold is set to the smallest emission variance amongall states y will approximately preserve the breakpointsaround those low-variance segments Those of high vari-ance are split into blocks of lower sample variance see[84 85] for an analytic expression While the variances forthe different states are not known FBG provides a priorisamples in each iteration We hence propose the follow-ing simple adaptation In each sampling iteration use thesmallest sampled variance parameter to create a new blocksequence via wavelet thresholding (Algorithm 1)

Algorithm 1 Dynamically adaptive FBG for HMMs

1 procedure HAMMLET(y πA πθ πq)2 T larr |y |3 λlarr

p2 ln T

4 A sim P(A |πA )5 θ sim P(θ |πθ )6 for i = 1 N do7 σminlarrminσi

σMADσi |σ2i isin θ

8 Create block sequence B from threshold λσmin9 qsim P(A Bθ |q)P(q |πq)

10 A sim P(q |A )P(A |πA )11 θ sim P(qB |θ )P(θ |πθ )12 Add count of marginal states for q to result13 end for14 end procedure

While intuitively easy to understand provable guaranteesfor the optimality of this method specifically the correspon-dence between the wavelet and the HMM domain remainan open research topic A potential problem could ariseif all sampled variances are too large In this case blockswould be under-segmented yield wrong posterior variancesand hide possible state transitions As a safeguard againstover-compression we use the standard method to estimatethe variance of constant noise in shrinkage applications

σ2MAD =

MADkdlog2 Tminus1k

Φminus1

34

2

as an estimate of the variance in the dominant com-ponent and modify the threshold definition to λ middotmin σMADσi isin θ If the data is not iid σ2

MAD will sys-tematically underestimate the true variance [24] In thiscase the blocks get smaller than necessary thus decreasingthe compression

A data structure for dynamic compression

The necessity to recreate a new block sequence in each it-eration based on the most recent estimate of the smallestvariance parameter creates the challenge of doing so with lit-tle computational overhead specifically without repeatedlycomputing the inverse wavelet transform or considering allT elements in other ways We achieve this by creating asimple tree-based data structure

The pyramid algorithm yields d sorted according to ( j k)Again let h = log2 T and ` = h minus j We can map thewavelet ψ jk to a perfect binary tree of height h such thatall wavelets for scale j are nodes on level ` nodes withineach level are sorted according to k and ` is increasingfrom the leaves to the root (Fig 7) d represents a breadth-first search (BFS) traversal of that tree with d jk being theentry at position b2 jc+ k Adding yi as the i-th leaf on level` = 0 each non-leaf node represents a wavelet which isnon-zero for the n = 2` data points yt for t in the theinterval I jk = [kn (k+ 1)nminus 1] stored in the leaves belownotice that for the leaves kn= t

This implies that the leaves in any subtree all have thesame value after wavelet thresholding if all the wavelets inthis subtree are set to zero We can hence avoid computingthe inverse wavelet transform to create blocks Instead eachnode stores the maximum absolute wavelet coefficient in theentire subtree as well as the sufficient statistics required forcalculating the likelihood function More formally a nodeN`t corresponds to wavelet ψ jk with ` = hminus j and t = k2`

(ψminus10 is simply constant on the [01) interval and has noeffect on block creation thus we discard it) Essentially `numbers the levels beginning at the leaves and t marks thestart position of the block when pruning the subtree rootedat N`t The members stored in each node are

bull The number of leaves corresponding to the block size

N`t[n] = 2`

bull The sum of data points stored in the subtree leaves

N`t[Σ1] =sum

iisinI jk

yi

bull Similarly the sum of squares

N`t[Σ2] =sum

iisinI jk

y2i

bull The maximum absolute wavelet coefficient of the sub-tree including the current d jk itself

N0t[d] = 0 N`gt0t[d] =max`primele`

tlet primeltt+2`

dhminus`prime2`primet prime

All these values can be computed recursively from the childnodes in linear time As some real data sets contain salt-and-pepper noise which manifests as isolated large coefficients

12

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

on the lowest level its is possible to ignore the first level inthe maximum computation so that no information to createa single-element block for outliers is passed up the tree Werefer to this technique as noise control Notice that this doesnot imply that blocks are only created at even t since truetransitions manifest in coefficients on multiple levels

The block creation algorithm is simple upon constructionthe tree is converted to depth-first search (DFS) order whichsimply amounts to sorting the BFS array according to (kn j)Given a threshold the tree is then traversed in DFS order byiterating linearly over the array (Fig 7 solid lines) Once themaximum coefficient stored in a node is less or equal to thethreshold a block of size n is created and the entire subtreeis skipped (dashed lines) As the tree is perfect binary andcomplete the next array position in DFS traversal afterpruning the subtree rooted at the node at index i is simplyobtained as i + 2nminus 1 so no expensive pointer structureneeds to be maintained leaving the tree data structure asimple flat array An example of dynamic block creation isgiven in Fig 8

Once the Gibbs sampler converges to a set of variancesthe block structure is less likely to change To avoid recre-ating the same block structure over and over again weemploy a technique called block structure prediction Sincethe different block structures are subdivisions of each otherthat occur in a specific order for decreasing σ2 there is asimple bijection between the number of blocks and the blockstructure itself Thus for each block sequence length weregister the minimum and maximum variance that createsthat sequence Upon entering a new iteration we checkif the current variance would create the same number ofblocks as in the previous iteration which guarantees thatwe would obtain the same block sequence and hence canavoid recomputation

The wavelet tree data structure can be readily extendedto multivariate data of dimensionality m Instead of storingm different trees and reconciling m different block patternsin each iteration one simply stores m different values foreach sufficient statistic in a tree node Since we have totraverse into the combined tree if the coefficient of any ofthe m trees was below the threshold we simply store thelargest N`t[d] among the corresponding nodes of the treeswhich means that the block creation can be done in O(T)instead of O(mT ) i e the dimensionality of the data onlyenters into the creation of the data structure but not thequery during sampling iterations

Automatic priors

While Bayesian methods allow for inductive bias such asthe expected location of means it is desirable to be able touse our method even when little domain knowledge existsor large variation is expected such as the lab and batcheffects commonly observed in micro-arrays [86] as wellas unknown means due to sample contamination SinceFBG does require a prior even in that case we propose the

0 5 10 15Starting position of block (t)

1

2

4

8

16

Bloc

k si

ze (n

)

DFS array (i+1)wavelet treepruning jumps (i+2n-1)

0

1

2

3

4

Leve

l ()

Figure 7 Mapping of wavelets ψ jk and data points yt to treenodes N`t Each node is the root of a subtree with n= 2` leavespruning that subtree yields a block of size n starting at positiont For instance the node N16 is located at position 13 of the DFSarray (solid line) and corresponds to the wavelet ψ33 A block ofsize n = 2 can be created by pruning the subtree which amountsto advancing by 2nminus 1= 3 positions (dashed line) yielding N38

at position 16 which is the wavelet ψ11 Thus the number ofsteps for creating blocks per iteration is at most the number ofnodes in the tree and thus strictly smaller than 2T

1248

163264

128256

Bloc

k si

ze

0 50 100 150 200 250Position along chromosome

10

05

00

05

10

15

20

25

Mea

sure

men

t

Figure 8 Example of dynamic block creation The data is of sizeT = 256 so the wavelet tree contains 512 nodes Here only 37entries had to be checked against the threshold (dark line) 19of which (round markers) yielded a block (vertical lines on thebottom) Sampling is hence done on a short array of 19 blocksinstead of 256 individual values thus the compression ratio is135 The horizontal lines in the bottom subplot are the blockmeans derived from the sufficient statistics in the nodes Noticehow the algorithm creates small blocks around the breakpointse g at t asymp 125 which requires traversing to lower levels andthus induces some additional blocks in other parts of the tree(left subtree) since all block sizes are powers of 2 This some-what reduces the compression ratio which is unproblematic as itincreases the degrees of freedom in the sampler

13

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

following method to specify hyperparameters of a weak priorautomatically Posterior samples of means and variances aredrawn from a Normal-Inverse Gamma distribution (microσ2)simNIΓ(micro0ναβ) whose marginals simply separate into aNormal and an Inverse Gamma distribution

σ2 sim IΓ(αβ) microsim N

micro0σ2

ν

Let s2 be a user-defined variance (or automatically inferede g from the largest of the finest detail coefficients orσ2

MAD) and p the desired probability to sample a variancenot larger than s2 From the CDF of IΓ we obtain

p = P(σ2 le s2) =Γ

α βs2

Γ (α)=Q

αβ

s2

IΓ has a mean for α gt 1 and closed-form solutions forα isin N Furthermore IΓ has positive skewness for α gt 3We thus let α= 2 which yields

β = minuss2

Wminus1

minuspe

+ 1

0lt p le 1

where Wminus1 is the negative branch of the Lambert W -functionwhich is transcendental However an excellent analyticalapproximation with a maximum error of 0025 is given in[87] which yields

β asymp s2

2p

b

M1

pb+p

2

M2 b exp

M3

pb

+ 1 + b

b = minus ln p

M1 = 03361 M2 = minus00042 M3 = minus00201

Since the mean of IΓ is βαminus1 the expected variance of micro is βν

for α= 2 To ensure proper mixing we could simply set βmicroto the sample variance of the data which can be estimatedfrom the sufficient statistics in the root of the wavelet tree(the first entry in the array) provided that micro contained allstates in almost equal number However due to possibleclass imbalance means for short segments far away from micro0can have low sampling probability as they do not contributemuch to the sample variance of the data We thus define δ tobe the sample variance of block means in the compressionobtained by σ2

MAD and take the maximum of those twovariances We thus obtain

micro0 =Σ1

n and ν= βmax

nΣ2 minusΣ21

n2δ

minus1

Numerical issues

To assure numerical stability many HMM implementationsresort to log-space computations Our implementationwhich differs from [62] reflects the block structure andavoids a considerable number of expensive function calls(exp log pow) required by others This yields an additional

speedup of about one order of magnitude compared to thelog version (data not shown) The term accounting for emis-sions and self-transitions within the block can be writtenas

Anwminus1j j

(2π)nw2σnwj

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

Any constant cancels out during normalization Further-more exponentiation of potentially small numbers causesunderflows We hence move those terms into the exponentutilizing the much stabler logarithm function

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

+ (nw minus 1) log A j j minus nw logσ j

Using the blockrsquos sufficient statistics

nw Σ1 =nwsum

k=1

ywk Σ2 =nwsum

k=1

y2wk

the exponent can be rewritten as

Ew( j) =2micro jΣ1 minusΣ2

2σ2j

+ K(nw j)

K(nw j) = (nw minus 1) log A j j minus nw

logσ j +micro2

j

2σ2j

Note that K(nw j) can be precomputed for each iterationthus greatly reducing the number of expensive function callsTo avoid overflow of the exponential function we subtractthe largest such exponents among all states hence Ew( j)le0 This is equivalent to dividing the forward variables by

exp

maxk

Ew(k)

which cancels out during normalization Hence we obtain

αw( j) = exp

Ew( j)minusmaxk

Ew(k)

nwsum

i=1

αwminus1(i)Ai j

which are then normalized to

αw( j) =αw( j)

sum

k αw(k)

Availability of supporting data

The implementation as well as the supplemental data areavailable at httpbioinformaticsrutgerseduSoftwareHaMMLET

Competing interests

The authors declare that the research was conducted in theabsence of any commercial or financial relationships thatcould be construed as a potential conflict of interest

14

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 12: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

to warrant compression universal thresholding is the bestmethod to approximate micro and thus the best wavelet-basedcompression scheme for a given noise level σ2

Integrating wavelet shrinkage into FBG

This compression method can easily be extended to multipleemission variances Since we use a thresholding methoddecreasing the variance simply subdivides existing blocks Ifthe threshold is set to the smallest emission variance amongall states y will approximately preserve the breakpointsaround those low-variance segments Those of high vari-ance are split into blocks of lower sample variance see[84 85] for an analytic expression While the variances forthe different states are not known FBG provides a priorisamples in each iteration We hence propose the follow-ing simple adaptation In each sampling iteration use thesmallest sampled variance parameter to create a new blocksequence via wavelet thresholding (Algorithm 1)

Algorithm 1 Dynamically adaptive FBG for HMMs

1 procedure HAMMLET(y πA πθ πq)2 T larr |y |3 λlarr

p2 ln T

4 A sim P(A |πA )5 θ sim P(θ |πθ )6 for i = 1 N do7 σminlarrminσi

σMADσi |σ2i isin θ

8 Create block sequence B from threshold λσmin9 qsim P(A Bθ |q)P(q |πq)

10 A sim P(q |A )P(A |πA )11 θ sim P(qB |θ )P(θ |πθ )12 Add count of marginal states for q to result13 end for14 end procedure

While intuitively easy to understand provable guaranteesfor the optimality of this method specifically the correspon-dence between the wavelet and the HMM domain remainan open research topic A potential problem could ariseif all sampled variances are too large In this case blockswould be under-segmented yield wrong posterior variancesand hide possible state transitions As a safeguard againstover-compression we use the standard method to estimatethe variance of constant noise in shrinkage applications

σ2MAD =

MADkdlog2 Tminus1k

Φminus1

34

2

as an estimate of the variance in the dominant com-ponent and modify the threshold definition to λ middotmin σMADσi isin θ If the data is not iid σ2

MAD will sys-tematically underestimate the true variance [24] In thiscase the blocks get smaller than necessary thus decreasingthe compression

A data structure for dynamic compression

The necessity to recreate a new block sequence in each it-eration based on the most recent estimate of the smallestvariance parameter creates the challenge of doing so with lit-tle computational overhead specifically without repeatedlycomputing the inverse wavelet transform or considering allT elements in other ways We achieve this by creating asimple tree-based data structure

The pyramid algorithm yields d sorted according to ( j k)Again let h = log2 T and ` = h minus j We can map thewavelet ψ jk to a perfect binary tree of height h such thatall wavelets for scale j are nodes on level ` nodes withineach level are sorted according to k and ` is increasingfrom the leaves to the root (Fig 7) d represents a breadth-first search (BFS) traversal of that tree with d jk being theentry at position b2 jc+ k Adding yi as the i-th leaf on level` = 0 each non-leaf node represents a wavelet which isnon-zero for the n = 2` data points yt for t in the theinterval I jk = [kn (k+ 1)nminus 1] stored in the leaves belownotice that for the leaves kn= t

This implies that the leaves in any subtree all have thesame value after wavelet thresholding if all the wavelets inthis subtree are set to zero We can hence avoid computingthe inverse wavelet transform to create blocks Instead eachnode stores the maximum absolute wavelet coefficient in theentire subtree as well as the sufficient statistics required forcalculating the likelihood function More formally a nodeN`t corresponds to wavelet ψ jk with ` = hminus j and t = k2`

(ψminus10 is simply constant on the [01) interval and has noeffect on block creation thus we discard it) Essentially `numbers the levels beginning at the leaves and t marks thestart position of the block when pruning the subtree rootedat N`t The members stored in each node are

bull The number of leaves corresponding to the block size

N`t[n] = 2`

bull The sum of data points stored in the subtree leaves

N`t[Σ1] =sum

iisinI jk

yi

bull Similarly the sum of squares

N`t[Σ2] =sum

iisinI jk

y2i

bull The maximum absolute wavelet coefficient of the sub-tree including the current d jk itself

N0t[d] = 0 N`gt0t[d] =max`primele`

tlet primeltt+2`

dhminus`prime2`primet prime

All these values can be computed recursively from the childnodes in linear time As some real data sets contain salt-and-pepper noise which manifests as isolated large coefficients

12

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

on the lowest level its is possible to ignore the first level inthe maximum computation so that no information to createa single-element block for outliers is passed up the tree Werefer to this technique as noise control Notice that this doesnot imply that blocks are only created at even t since truetransitions manifest in coefficients on multiple levels

The block creation algorithm is simple upon constructionthe tree is converted to depth-first search (DFS) order whichsimply amounts to sorting the BFS array according to (kn j)Given a threshold the tree is then traversed in DFS order byiterating linearly over the array (Fig 7 solid lines) Once themaximum coefficient stored in a node is less or equal to thethreshold a block of size n is created and the entire subtreeis skipped (dashed lines) As the tree is perfect binary andcomplete the next array position in DFS traversal afterpruning the subtree rooted at the node at index i is simplyobtained as i + 2nminus 1 so no expensive pointer structureneeds to be maintained leaving the tree data structure asimple flat array An example of dynamic block creation isgiven in Fig 8

Once the Gibbs sampler converges to a set of variancesthe block structure is less likely to change To avoid recre-ating the same block structure over and over again weemploy a technique called block structure prediction Sincethe different block structures are subdivisions of each otherthat occur in a specific order for decreasing σ2 there is asimple bijection between the number of blocks and the blockstructure itself Thus for each block sequence length weregister the minimum and maximum variance that createsthat sequence Upon entering a new iteration we checkif the current variance would create the same number ofblocks as in the previous iteration which guarantees thatwe would obtain the same block sequence and hence canavoid recomputation

The wavelet tree data structure can be readily extendedto multivariate data of dimensionality m Instead of storingm different trees and reconciling m different block patternsin each iteration one simply stores m different values foreach sufficient statistic in a tree node Since we have totraverse into the combined tree if the coefficient of any ofthe m trees was below the threshold we simply store thelargest N`t[d] among the corresponding nodes of the treeswhich means that the block creation can be done in O(T)instead of O(mT ) i e the dimensionality of the data onlyenters into the creation of the data structure but not thequery during sampling iterations

Automatic priors

While Bayesian methods allow for inductive bias such asthe expected location of means it is desirable to be able touse our method even when little domain knowledge existsor large variation is expected such as the lab and batcheffects commonly observed in micro-arrays [86] as wellas unknown means due to sample contamination SinceFBG does require a prior even in that case we propose the

0 5 10 15Starting position of block (t)

1

2

4

8

16

Bloc

k si

ze (n

)

DFS array (i+1)wavelet treepruning jumps (i+2n-1)

0

1

2

3

4

Leve

l ()

Figure 7 Mapping of wavelets ψ jk and data points yt to treenodes N`t Each node is the root of a subtree with n= 2` leavespruning that subtree yields a block of size n starting at positiont For instance the node N16 is located at position 13 of the DFSarray (solid line) and corresponds to the wavelet ψ33 A block ofsize n = 2 can be created by pruning the subtree which amountsto advancing by 2nminus 1= 3 positions (dashed line) yielding N38

at position 16 which is the wavelet ψ11 Thus the number ofsteps for creating blocks per iteration is at most the number ofnodes in the tree and thus strictly smaller than 2T

1248

163264

128256

Bloc

k si

ze

0 50 100 150 200 250Position along chromosome

10

05

00

05

10

15

20

25

Mea

sure

men

t

Figure 8 Example of dynamic block creation The data is of sizeT = 256 so the wavelet tree contains 512 nodes Here only 37entries had to be checked against the threshold (dark line) 19of which (round markers) yielded a block (vertical lines on thebottom) Sampling is hence done on a short array of 19 blocksinstead of 256 individual values thus the compression ratio is135 The horizontal lines in the bottom subplot are the blockmeans derived from the sufficient statistics in the nodes Noticehow the algorithm creates small blocks around the breakpointse g at t asymp 125 which requires traversing to lower levels andthus induces some additional blocks in other parts of the tree(left subtree) since all block sizes are powers of 2 This some-what reduces the compression ratio which is unproblematic as itincreases the degrees of freedom in the sampler

13

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

following method to specify hyperparameters of a weak priorautomatically Posterior samples of means and variances aredrawn from a Normal-Inverse Gamma distribution (microσ2)simNIΓ(micro0ναβ) whose marginals simply separate into aNormal and an Inverse Gamma distribution

σ2 sim IΓ(αβ) microsim N

micro0σ2

ν

Let s2 be a user-defined variance (or automatically inferede g from the largest of the finest detail coefficients orσ2

MAD) and p the desired probability to sample a variancenot larger than s2 From the CDF of IΓ we obtain

p = P(σ2 le s2) =Γ

α βs2

Γ (α)=Q

αβ

s2

IΓ has a mean for α gt 1 and closed-form solutions forα isin N Furthermore IΓ has positive skewness for α gt 3We thus let α= 2 which yields

β = minuss2

Wminus1

minuspe

+ 1

0lt p le 1

where Wminus1 is the negative branch of the Lambert W -functionwhich is transcendental However an excellent analyticalapproximation with a maximum error of 0025 is given in[87] which yields

β asymp s2

2p

b

M1

pb+p

2

M2 b exp

M3

pb

+ 1 + b

b = minus ln p

M1 = 03361 M2 = minus00042 M3 = minus00201

Since the mean of IΓ is βαminus1 the expected variance of micro is βν

for α= 2 To ensure proper mixing we could simply set βmicroto the sample variance of the data which can be estimatedfrom the sufficient statistics in the root of the wavelet tree(the first entry in the array) provided that micro contained allstates in almost equal number However due to possibleclass imbalance means for short segments far away from micro0can have low sampling probability as they do not contributemuch to the sample variance of the data We thus define δ tobe the sample variance of block means in the compressionobtained by σ2

MAD and take the maximum of those twovariances We thus obtain

micro0 =Σ1

n and ν= βmax

nΣ2 minusΣ21

n2δ

minus1

Numerical issues

To assure numerical stability many HMM implementationsresort to log-space computations Our implementationwhich differs from [62] reflects the block structure andavoids a considerable number of expensive function calls(exp log pow) required by others This yields an additional

speedup of about one order of magnitude compared to thelog version (data not shown) The term accounting for emis-sions and self-transitions within the block can be writtenas

Anwminus1j j

(2π)nw2σnwj

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

Any constant cancels out during normalization Further-more exponentiation of potentially small numbers causesunderflows We hence move those terms into the exponentutilizing the much stabler logarithm function

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

+ (nw minus 1) log A j j minus nw logσ j

Using the blockrsquos sufficient statistics

nw Σ1 =nwsum

k=1

ywk Σ2 =nwsum

k=1

y2wk

the exponent can be rewritten as

Ew( j) =2micro jΣ1 minusΣ2

2σ2j

+ K(nw j)

K(nw j) = (nw minus 1) log A j j minus nw

logσ j +micro2

j

2σ2j

Note that K(nw j) can be precomputed for each iterationthus greatly reducing the number of expensive function callsTo avoid overflow of the exponential function we subtractthe largest such exponents among all states hence Ew( j)le0 This is equivalent to dividing the forward variables by

exp

maxk

Ew(k)

which cancels out during normalization Hence we obtain

αw( j) = exp

Ew( j)minusmaxk

Ew(k)

nwsum

i=1

αwminus1(i)Ai j

which are then normalized to

αw( j) =αw( j)

sum

k αw(k)

Availability of supporting data

The implementation as well as the supplemental data areavailable at httpbioinformaticsrutgerseduSoftwareHaMMLET

Competing interests

The authors declare that the research was conducted in theabsence of any commercial or financial relationships thatcould be construed as a potential conflict of interest

14

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 13: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

on the lowest level its is possible to ignore the first level inthe maximum computation so that no information to createa single-element block for outliers is passed up the tree Werefer to this technique as noise control Notice that this doesnot imply that blocks are only created at even t since truetransitions manifest in coefficients on multiple levels

The block creation algorithm is simple upon constructionthe tree is converted to depth-first search (DFS) order whichsimply amounts to sorting the BFS array according to (kn j)Given a threshold the tree is then traversed in DFS order byiterating linearly over the array (Fig 7 solid lines) Once themaximum coefficient stored in a node is less or equal to thethreshold a block of size n is created and the entire subtreeis skipped (dashed lines) As the tree is perfect binary andcomplete the next array position in DFS traversal afterpruning the subtree rooted at the node at index i is simplyobtained as i + 2nminus 1 so no expensive pointer structureneeds to be maintained leaving the tree data structure asimple flat array An example of dynamic block creation isgiven in Fig 8

Once the Gibbs sampler converges to a set of variancesthe block structure is less likely to change To avoid recre-ating the same block structure over and over again weemploy a technique called block structure prediction Sincethe different block structures are subdivisions of each otherthat occur in a specific order for decreasing σ2 there is asimple bijection between the number of blocks and the blockstructure itself Thus for each block sequence length weregister the minimum and maximum variance that createsthat sequence Upon entering a new iteration we checkif the current variance would create the same number ofblocks as in the previous iteration which guarantees thatwe would obtain the same block sequence and hence canavoid recomputation

The wavelet tree data structure can be readily extendedto multivariate data of dimensionality m Instead of storingm different trees and reconciling m different block patternsin each iteration one simply stores m different values foreach sufficient statistic in a tree node Since we have totraverse into the combined tree if the coefficient of any ofthe m trees was below the threshold we simply store thelargest N`t[d] among the corresponding nodes of the treeswhich means that the block creation can be done in O(T)instead of O(mT ) i e the dimensionality of the data onlyenters into the creation of the data structure but not thequery during sampling iterations

Automatic priors

While Bayesian methods allow for inductive bias such asthe expected location of means it is desirable to be able touse our method even when little domain knowledge existsor large variation is expected such as the lab and batcheffects commonly observed in micro-arrays [86] as wellas unknown means due to sample contamination SinceFBG does require a prior even in that case we propose the

0 5 10 15Starting position of block (t)

1

2

4

8

16

Bloc

k si

ze (n

)

DFS array (i+1)wavelet treepruning jumps (i+2n-1)

0

1

2

3

4

Leve

l ()

Figure 7 Mapping of wavelets ψ jk and data points yt to treenodes N`t Each node is the root of a subtree with n= 2` leavespruning that subtree yields a block of size n starting at positiont For instance the node N16 is located at position 13 of the DFSarray (solid line) and corresponds to the wavelet ψ33 A block ofsize n = 2 can be created by pruning the subtree which amountsto advancing by 2nminus 1= 3 positions (dashed line) yielding N38

at position 16 which is the wavelet ψ11 Thus the number ofsteps for creating blocks per iteration is at most the number ofnodes in the tree and thus strictly smaller than 2T

1248

163264

128256

Bloc

k si

ze

0 50 100 150 200 250Position along chromosome

10

05

00

05

10

15

20

25

Mea

sure

men

t

Figure 8 Example of dynamic block creation The data is of sizeT = 256 so the wavelet tree contains 512 nodes Here only 37entries had to be checked against the threshold (dark line) 19of which (round markers) yielded a block (vertical lines on thebottom) Sampling is hence done on a short array of 19 blocksinstead of 256 individual values thus the compression ratio is135 The horizontal lines in the bottom subplot are the blockmeans derived from the sufficient statistics in the nodes Noticehow the algorithm creates small blocks around the breakpointse g at t asymp 125 which requires traversing to lower levels andthus induces some additional blocks in other parts of the tree(left subtree) since all block sizes are powers of 2 This some-what reduces the compression ratio which is unproblematic as itincreases the degrees of freedom in the sampler

13

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

following method to specify hyperparameters of a weak priorautomatically Posterior samples of means and variances aredrawn from a Normal-Inverse Gamma distribution (microσ2)simNIΓ(micro0ναβ) whose marginals simply separate into aNormal and an Inverse Gamma distribution

σ2 sim IΓ(αβ) microsim N

micro0σ2

ν

Let s2 be a user-defined variance (or automatically inferede g from the largest of the finest detail coefficients orσ2

MAD) and p the desired probability to sample a variancenot larger than s2 From the CDF of IΓ we obtain

p = P(σ2 le s2) =Γ

α βs2

Γ (α)=Q

αβ

s2

IΓ has a mean for α gt 1 and closed-form solutions forα isin N Furthermore IΓ has positive skewness for α gt 3We thus let α= 2 which yields

β = minuss2

Wminus1

minuspe

+ 1

0lt p le 1

where Wminus1 is the negative branch of the Lambert W -functionwhich is transcendental However an excellent analyticalapproximation with a maximum error of 0025 is given in[87] which yields

β asymp s2

2p

b

M1

pb+p

2

M2 b exp

M3

pb

+ 1 + b

b = minus ln p

M1 = 03361 M2 = minus00042 M3 = minus00201

Since the mean of IΓ is βαminus1 the expected variance of micro is βν

for α= 2 To ensure proper mixing we could simply set βmicroto the sample variance of the data which can be estimatedfrom the sufficient statistics in the root of the wavelet tree(the first entry in the array) provided that micro contained allstates in almost equal number However due to possibleclass imbalance means for short segments far away from micro0can have low sampling probability as they do not contributemuch to the sample variance of the data We thus define δ tobe the sample variance of block means in the compressionobtained by σ2

MAD and take the maximum of those twovariances We thus obtain

micro0 =Σ1

n and ν= βmax

nΣ2 minusΣ21

n2δ

minus1

Numerical issues

To assure numerical stability many HMM implementationsresort to log-space computations Our implementationwhich differs from [62] reflects the block structure andavoids a considerable number of expensive function calls(exp log pow) required by others This yields an additional

speedup of about one order of magnitude compared to thelog version (data not shown) The term accounting for emis-sions and self-transitions within the block can be writtenas

Anwminus1j j

(2π)nw2σnwj

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

Any constant cancels out during normalization Further-more exponentiation of potentially small numbers causesunderflows We hence move those terms into the exponentutilizing the much stabler logarithm function

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

+ (nw minus 1) log A j j minus nw logσ j

Using the blockrsquos sufficient statistics

nw Σ1 =nwsum

k=1

ywk Σ2 =nwsum

k=1

y2wk

the exponent can be rewritten as

Ew( j) =2micro jΣ1 minusΣ2

2σ2j

+ K(nw j)

K(nw j) = (nw minus 1) log A j j minus nw

logσ j +micro2

j

2σ2j

Note that K(nw j) can be precomputed for each iterationthus greatly reducing the number of expensive function callsTo avoid overflow of the exponential function we subtractthe largest such exponents among all states hence Ew( j)le0 This is equivalent to dividing the forward variables by

exp

maxk

Ew(k)

which cancels out during normalization Hence we obtain

αw( j) = exp

Ew( j)minusmaxk

Ew(k)

nwsum

i=1

αwminus1(i)Ai j

which are then normalized to

αw( j) =αw( j)

sum

k αw(k)

Availability of supporting data

The implementation as well as the supplemental data areavailable at httpbioinformaticsrutgerseduSoftwareHaMMLET

Competing interests

The authors declare that the research was conducted in theabsence of any commercial or financial relationships thatcould be construed as a potential conflict of interest

14

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 14: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

following method to specify hyperparameters of a weak priorautomatically Posterior samples of means and variances aredrawn from a Normal-Inverse Gamma distribution (microσ2)simNIΓ(micro0ναβ) whose marginals simply separate into aNormal and an Inverse Gamma distribution

σ2 sim IΓ(αβ) microsim N

micro0σ2

ν

Let s2 be a user-defined variance (or automatically inferede g from the largest of the finest detail coefficients orσ2

MAD) and p the desired probability to sample a variancenot larger than s2 From the CDF of IΓ we obtain

p = P(σ2 le s2) =Γ

α βs2

Γ (α)=Q

αβ

s2

IΓ has a mean for α gt 1 and closed-form solutions forα isin N Furthermore IΓ has positive skewness for α gt 3We thus let α= 2 which yields

β = minuss2

Wminus1

minuspe

+ 1

0lt p le 1

where Wminus1 is the negative branch of the Lambert W -functionwhich is transcendental However an excellent analyticalapproximation with a maximum error of 0025 is given in[87] which yields

β asymp s2

2p

b

M1

pb+p

2

M2 b exp

M3

pb

+ 1 + b

b = minus ln p

M1 = 03361 M2 = minus00042 M3 = minus00201

Since the mean of IΓ is βαminus1 the expected variance of micro is βν

for α= 2 To ensure proper mixing we could simply set βmicroto the sample variance of the data which can be estimatedfrom the sufficient statistics in the root of the wavelet tree(the first entry in the array) provided that micro contained allstates in almost equal number However due to possibleclass imbalance means for short segments far away from micro0can have low sampling probability as they do not contributemuch to the sample variance of the data We thus define δ tobe the sample variance of block means in the compressionobtained by σ2

MAD and take the maximum of those twovariances We thus obtain

micro0 =Σ1

n and ν= βmax

nΣ2 minusΣ21

n2δ

minus1

Numerical issues

To assure numerical stability many HMM implementationsresort to log-space computations Our implementationwhich differs from [62] reflects the block structure andavoids a considerable number of expensive function calls(exp log pow) required by others This yields an additional

speedup of about one order of magnitude compared to thelog version (data not shown) The term accounting for emis-sions and self-transitions within the block can be writtenas

Anwminus1j j

(2π)nw2σnwj

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

Any constant cancels out during normalization Further-more exponentiation of potentially small numbers causesunderflows We hence move those terms into the exponentutilizing the much stabler logarithm function

exp

minusnwsum

k=1

(ywk minusmicro j)2

2σ2j

+ (nw minus 1) log A j j minus nw logσ j

Using the blockrsquos sufficient statistics

nw Σ1 =nwsum

k=1

ywk Σ2 =nwsum

k=1

y2wk

the exponent can be rewritten as

Ew( j) =2micro jΣ1 minusΣ2

2σ2j

+ K(nw j)

K(nw j) = (nw minus 1) log A j j minus nw

logσ j +micro2

j

2σ2j

Note that K(nw j) can be precomputed for each iterationthus greatly reducing the number of expensive function callsTo avoid overflow of the exponential function we subtractthe largest such exponents among all states hence Ew( j)le0 This is equivalent to dividing the forward variables by

exp

maxk

Ew(k)

which cancels out during normalization Hence we obtain

αw( j) = exp

Ew( j)minusmaxk

Ew(k)

nwsum

i=1

αwminus1(i)Ai j

which are then normalized to

αw( j) =αw( j)

sum

k αw(k)

Availability of supporting data

The implementation as well as the supplemental data areavailable at httpbioinformaticsrutgerseduSoftwareHaMMLET

Competing interests

The authors declare that the research was conducted in theabsence of any commercial or financial relationships thatcould be construed as a potential conflict of interest

14

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 15: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

Authorsrsquo contributions

JW and AS conceived and supervised the study and wrotethe paper EB and JW developed the implementation Allauthors discussed read and approved the article

Acknowledgements

We would like to thank Md Pavel Mahmud for helpful adviceE Brugel participated in the DIMACS REU program withsupport from NSF award 1263082

References

[1] Iafrate AJ Feuk L Rivera MN Listewnik MLDonahoe PK Qi Y Scherer SW Lee C Detectionof large-scale variation in the human genome NatureGenetics 36(9) 949ndash51 (2004) doi101038ng1416

[2] Feuk L Marshall CR Wintle RF Scherer SWStructural variants changing the landscape of chro-mosomes and design of disease studies HumanMolecular Genetics 15 Spec No 57ndash66 (2006)doi101093hmgddl057

[3] McCarroll SA Altshuler DM Copy-numbervariation and association studies of human dis-ease Nature Genetics 39(7 Suppl) 37ndash42 (2007)doi101038ng2080

[4] Wain LV Armour JAL Tobin MD Genomic copynumber variation human health and disease Lancet374(9686) 340ndash350 (2009) doi101016S0140-6736(09)60249-X

[5] Beckmann JS Sharp AJ Antonarakis SE CNVsand genetic medicine (excitement and consequencesof a rediscovery) Cytogenetic and Genome Research123(1-4) 7ndash16 (2008) doi101159000184687

[6] Cook EH Scherer SW Copy-number vari-ations associated with neuropsychiatric con-ditions Nature 455(7215) 919ndash23 (2008)doi101038nature07458

[7] Buretic-Tomljanovic A Tomljanovic D Humangenome variation in health and in neuropsychiatricdisorders Psychiatria Danubina 21(4) 562ndash9 (2009)

[8] Merikangas AK Corvin AP Gallagher L Copy-number variants in neurodevelopmental disorderspromises and challenges Trends in Genetics 25(12)536ndash44 (2009) doi101016jtig200910006

[9] Cho EK Tchinda J Freeman JL Chung Y-JCai WW Lee C Array-based comparative genomichybridization and copy number variation in cancerresearch Cytogenetic and Genome Research 115(3-4) 262ndash72 (2006) doi101159000095923

[10] Shlien A Malkin D Copy number vari-ations and cancer susceptibility CurrentOpinion in Oncology 22(1) 55ndash63 (2010)doi101097CCO0b013e328333dca4

[11] Feuk L Carson AR Scherer SW Structural varia-tion in the human genome Nature Reviews Genetics7(2) 85ndash97 (2006) doi101038nrg1767

[12] Beckmann JS Estivill X Antonarakis SECopy number variants and genetic traits closer tothe resolution of phenotypic to genotypic variabil-ity Nature Reviews Genetics 8(8) 639ndash46 (2007)doi101038nrg2149

[13] Sharp AJ Emerging themes and new challengesin defining the role of structural variation in hu-man disease Human Mutation 30(2) 135ndash44 (2009)doi101002humu20843

[14] Garnis C Coe BP Zhang L Rosin MPLam WL Overexpression of LRP12 a gene con-tained within an 8q22 amplicon identified by high-resolution array CGH analysis of oral squamouscell carcinomas Oncogene 23(14) 2582ndash6 (2004)doi101038sjonc1207367

[15] Albertson DG Ylstra B Segraves R Collins CDairkee SH Kowbel D Kuo WL Gray JWPinkel D Quantitative mapping of amplicon struc-ture by array CGH identifies CYP24 as a candidateoncogene Nature Genetics 25(2) 144ndash6 (2000)doi10103875985

[16] Veltman JA Fridlyand J Pejavar S Olshen ABKorkola JE DeVries S Carroll P Kuo W-L PinkelD Albertson DG Cordon-Cardo C Jain ANWaldman FM Array-based comparative genomichybridization for genome-wide screening of DNA copynumber in bladder tumors Cancer Research 63(11)2872ndash80 (2003)

[17] Autio R Hautaniemi S Kauraniemi P Yli-Harja O Astola J Wolf M Kallioniemi ACGH-Plotter MATLAB toolbox for CGH-data ana-lysis Bioinformatics 19(13) 1714ndash1715 (2003)doi101093bioinformaticsbtg230

[18] Pollack JR Soslashrlie T Perou CM Rees CA Jef-frey SS Lonning PE Tibshirani R Botstein DBoslashrresen-Dale A-L Brown PO Microarray analy-sis reveals a major direct role of DNA copy numberalteration in the transcriptional program of humanbreast tumors Proceedings of the National Academyof Sciences of the United States of America 99(20)12963ndash8 (2002) doi101073pnas162471999

[19] Eilers PHC de Menezes RX Quantile smoothingof array CGH data Bioinformatics 21(7) 1146ndash53(2005) doi101093bioinformaticsbti148

15

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 16: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

[20] Cleveland WS Robust Locally Weighted Regres-sion and Smoothing Scatterplots Journal of theAmerican Statistical Association 74(368) (1979)doi10108001621459197910481038

[21] Beheshti B Braude I Marrano P Thorner P Zie-lenska M Squire JA Chromosomal localization ofDNA amplifications in neuroblastoma tumors usingcDNA microarray comparative genomic hybridizationNeoplasia 5(1) 53ndash62 (2003)

[22] Polzehl J Spokoiny VG Adaptive weights smooth-ing with applications to image restoration Jour-nal of the Royal Statistical Society Series B(Statistical Methodology) 62(2) 335ndash354 (2000)doi1011111467-986800235

[23] Hupeacute P Stransky N Thiery J-P Radvanyi F Baril-lot E Analysis of array CGH data from signal ratio togain and loss of DNA regions Bioinformatics 20(18)3413ndash22 (2004) doi101093bioinformaticsbth418

[24] Wang Y Wang S A novel stationary wavelet de-noising algorithm for array-based DNA Copy Num-ber data International Journal of BioinformaticsResearch and Applications 3(x) 206ndash222 (2007)doi101504IJBRA2007013603

[25] Hsu L Self SG Grove D Randolph T WangK Delrow JJ Loo L Porter P Denoising array-based comparative genomic hybridization data usingwavelets Biostatistics (Oxford England) 6(2) 211ndash26(2005) doi101093biostatisticskxi004

[26] Nguyen N Huang H Oraintara S Vo A ANew Smoothing Model for Analyzing Array CGHData In Proceedings of the 7th IEEE InternationalConference on Bioinformatics and BioengineeringBoston MA (2007) doi101109BIBE20074375683httpieeexploreieeeorgxplsabs_alljsparnumber=4375683

[27] Nguyen N Huang H Oraintara S Vo A Station-ary wavelet packet transform and dependent lapla-cian bivariate shrinkage estimator for array-CGH datasmoothing Journal of Computational Biology 17(2)139ndash52 (2010) doi101089cmb20090013

[28] Huang H Nguyen N Oraintara S Vo A Ar-ray CGH data modeling and smoothing in StationaryWavelet Packet Transform domain BMC Genomics9 Suppl 2 17 (2008) doi1011861471-2164-9-S2-S17

[29] Holt C Losic B Pai D Zhao Z Trinh Q SyamS Arshadi N Jang GH Ali J Beck T McPhersonJ Muthuswamy LB WaveCNV allele-specific copynumber alterations in primary tumors and xenograftmodels from next-generation sequencing Bioinformat-ics 611 (2013) doi101093bioinformaticsbtt611

[30] Price TS Regan R Mott R Hedman A HoneyB Daniels RJ Smith L Greenfield A TiganescuA Buckle V Ventress N Ayyub H SalhanA Pedraza-Diaz S Broxholme J Ragoussis JHiggs DR Flint J Knight SJL SW-ARRAYa dynamic programming solution for the identifica-tion of copy-number changes in genomic DNA us-ing array comparative genome hybridization dataNucleic Acids Research 33(11) 3455ndash3464 (2005)doi101093nargki643

[31] Tsourakakis CE Peng R Tsiarli MA Miller GLSchwartz R Approximation algorithms for speedingup dynamic programming and denoising aCGH dataJournal of Experimental Algorithmics 16 1ndash1 (2011)doi10114519631902063517

[32] Olshen AB Venkatraman ES Change-point analy-sis of array-based comparative genomic hybridizationdata ASA Proceedings of the Joint Statistical Meetings2530ndash2535 (2002)

[33] Olshen AB Venkatraman ES Lucito R WiglerM Circular binary segmentation for the analy-sis of array-based DNA copy number data Bio-statistics (Oxford England) 5(4) 557ndash572 (2004)doi101093biostatisticskxh008

[34] Venkatraman ES Olshen AB A faster circularbinary segmentation algorithm for the analysis of ar-ray CGH data Bioinformatics 23(6) 657ndash63 (2007)doi101093bioinformaticsbtl646

[35] Picard F Robin S Lavielle M Vaisse CDaudin J-J A statistical approach for array CGHdata analysis BMC Bioinformatics 6(1) 27 (2005)doi1011861471-2105-6-27

[36] Myers CL Dunham MJ Kung SY Troy-anskaya OG Accurate detection of aneuploi-dies in array CGH and gene expression microar-ray data Bioinformatics 20(18) 3533ndash43 (2004)doi101093bioinformaticsbth440

[37] Wang P Kim Y Pollack J Narasimhan B Tibshi-rani R A method for calling gains and losses in arrayCGH data Biostatistics (Oxford England) 6(1) 45ndash58(2005) doi101093biostatisticskxh017

[38] Xing B Greenwood CMT Bull SB A hierarchicalclustering method for estimating copy number vari-ation Biostatistics (Oxford England) 8(3) 632ndash53(2007) doi101093biostatisticskxl035

[39] Chen C-H Lee H-CC Ling Q Chen H-R KoY-A Tsou T-S Wang S-C Wu L-C Lee H-CCAn all-statistics high-speed algorithm for the analysisof copy number variation in genomes Nucleic AcidsResearch 39(13) 89 (2011) doi101093nargkr137

16

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 17: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

[40] Jong K Marchiori E van der Vaart A YlstraB Weiss M Meijer G Chromosomal BreakpointDetection in Human Cancer Lecture Notes in Com-puter Science vol 2611 Springer Berlin Heidelberg(2003) doi1010073-540-36605-9 httplinkspringercom1010073-540-36605-9

[41] Baum LE Petrie T Statistical Inference forProbabilistic Functions of Finite State MarkovChains (1966) doi101214aoms1177699147httpwwwjstororgstable2238772seq=1page_scan_tab_contentshttpwwwjstororgstable2238772 Accessed2015-02-14

[42] Snijders AM Fridlyand J Mans DA SegravesR Jain AN Pinkel D Albertson DG Shap-ing of tumor and drug-resistant genomes by instabil-ity and selection Oncogene 22(28) 4370ndash9 (2003)doi101038sjonc1206482

[43] Sebat J Lakshmi B Troge J Alexander J YoungJ Lundin P Maring neacuter S Massa H Walker M Chi MNavin N Lucito R Healy J Hicks J Ye K ReinerA Gilliam TC Trask B Patterson N Zetterberg AWigler M Large-scale copy number polymorphismin the human genome Science 305(5683) 525ndash8(2004) doi101126science1098918

[44] Sebat J Lakshmi B Malhotra D Troge J Lese-Martin C Walsh T Yamrom B Yoon S KrasnitzA Kendall J Leotta A Pai D Zhang R Lee Y-HHicks J Spence SJ Lee AT Puura K LehtimaumlkiT Ledbetter D Gregersen PK Bregman J SutcliffeJS Jobanputra V Chung W Warburton D KingM-C Skuse D Geschwind DH Gilliam TC YeK Wigler M Strong association of de novo copynumber mutations with autism Science 316(5823)445ndash449 (2007) doi101126science1138659

[45] Fridlyand J Snijders AM Pinkel D Albert-son DG Jain AN Hidden Markov models ap-proach to the analysis of array CGH data Jour-nal of Multivariate Analysis 90(1) 132ndash153 (2004)doi101016jjmva200402008

[46] Zhao X An Integrated View of Copy Numberand Allelic Alterations in the Cancer Genome UsingSingle Nucleotide Polymorphism Arrays Cancer Re-search 64(9) 3060ndash3071 (2004) doi1011580008-5472CAN-03-3308

[47] de Vries BBA Pfundt R Leisink M Koolen DAVissers LELM Janssen IM van Reijmersdal SNillesen WM Huys EHLPG de Leeuw N SmeetsD Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA van Reijmersdal S Nille-sen WM Huys EHLPG de Leeuw N Smeets

D Sistermans EA Feuth T van Ravenswaaij-ArtsCMA van Kessel AG Schoenmakers EFPM Brun-ner HG Veltman JA Diagnostic genome profilingin mental retardation American Journal of Human Ge-netics 77(4) 606ndash616 (2005) doi101086491719

[48] Nannya Y Sanada M Nakazaki K Hosoya NWang L Hangaishi A Kurokawa M Chiba SBailey DK Kennedy GC Ogawa S A robust algo-rithm for copy number detection using high-densityoligonucleotide single nucleotide polymorphism geno-typing arrays Cancer Research 65(14) 6071ndash9(2005) doi1011580008-5472CAN-05-0465

[49] Marioni JC Thorne NP Tavareacute S BioHMM aheterogeneous hidden Markov model for segmentingarray CGH data Bioinformatics 22(9) 1144ndash6 (2006)doi101093bioinformaticsbtl089

[50] Korbel JO Urban AE Grubert F Du J RoyceTE Starr P Zhong G Emanuel BS WeissmanSM Snyder M Gerstein MB Systematic predic-tion and validation of breakpoints associated withcopy-number variants in the human genome Proceed-ings of the National Academy of Sciences of the UnitedStates of America 104(24) 10110ndash10115 (2007)doi101073pnas0703834104

[51] Cahan P Godfrey LE Eis PS Richmond TASelzer RR Brent M McLeod HL Ley TJGraubert TA wuHMM a robust algorithm to de-tect DNA copy number variation using long oligonu-cleotide microarray data Nucleic Acids Research36(7) 41 (2008) doi101093nargkn110

[52] Rueda OM Diaz-Uriarte R RJaCGHBayesian analysis of aCGH arrays for detectingcopy number changes and recurrent regionsBioinformatics 25(15) 1959ndash1960 (2009)doi101093bioinformaticsbtp307

[53] Bilmes J A Gentle Tutorial of the EM Algo-rithm and its Application to Parameter Estimationfor Gaussian Mixture and Hidden Markov Mod-els (1998) httpciteseerxistpsueduviewdocsummarydoi=101128613

[54] Rabiner LR A tutorial on Hidden Markov Mod-els and Selected Applications in Speech Recogni-tion Proceedings of the IEEE 77 257ndash286 (1989)doi101109518626

[55] Viterbi A Error bounds for convolutional codes andan asymptotically optimum decoding algorithm IEEETransactions on Information Theory 13(2) 260ndash269(1967) doi101109TIT19671054010

[56] Forney GD The Viterbi algorithm Proceed-ings of the IEEE 61(3) 268ndash278 (1973)doi101109PROC19739030

17

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 18: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

[57] Chib S Calculating posterior distributions and modalestimates in Markov mixture models Journal of Econo-metrics 75(1) 79ndash97 (1996)

[58] Scott SL Bayesian Methods for Hidden Markov Mod-els Recursive Computing in the 21st Century Journalof the American Statistical Association 97(457) 337ndash351 (2002)

[59] Guha S Li Y Neuberg D Bayesian Hid-den Markov Modeling of Array CGH Data(2006) httpbiostatsbepresscomharvardbiostatpaper24

[60] Shah SP Xuan X DeLeeuw RJ Khojasteh MLam WL Ng R Murphy KP Integrating copy num-ber polymorphisms into array CGH analysis using arobust HMM Bioinformatics 22(14) 431ndash9 (2006)doi101093bioinformaticsbtl238

[61] Shah SP Lam WL Ng RT Murphy KP Mod-eling recurrent DNA copy number alterations in ar-ray CGH data Bioinformatics 23(13) 450ndash8 (2007)doi101093bioinformaticsbtm221

[62] Mahmud MP Schliep A Fast MCMC samplingfor Hidden Markov Models to determine copy num-ber variations BMC Bioinformatics 12 428 (2011)doi1011861471-2105-12-428

[63] Ben-Yaacov E Eldar YC A fast and flexi-ble method for the segmentation of aCGHdata Bioinformatics 24(16) 139ndash45 (2008)doi101093bioinformaticsbtn272

[64] Wang J Meza-Zepeda LA Kresse SH Mykle-bost O M-CGH analysing microarray-based CGHexperiments BMC Bioinformatics 5(1) 74 (2004)doi1011861471-2105-5-74

[65] Lai WR Johnson MD Kucherlapati R ParkPJ Comparative analysis of algorithms for iden-tifying amplifications and deletions in array CGHdata Bioinformatics 21(19) 3763ndash3770 (2005)doi101093bioinformaticsbti611

[66] Willenbrock H Fridlyand J A comparison studyapplying segmentation to array CGH data for down-stream analyses Bioinformatics 21(22) 4084ndash91(2005) doi101093bioinformaticsbti677

[67] Hodgson G Hager JH Volik S Hariono S Wer-nick M Moore D Nowak N Albertson DGPinkel D Collins C Hanahan D Gray JWGenome scanning with array CGH delineates regionalalterations in mouse islet carcinomas Nature Genetics29(4) 459ndash64 (2001) doi101038ng771

[68] Edgren H Murumagi A Kangaspeska S NicoriciD Hongisto V Kleivi K Rye IH Nyberg S Wolf

M Boslashrresen-Dale A-L Kallioniemi O Identifica-tion of fusion genes in breast cancer by paired-endRNA-sequencing Genome Biology 12(1) 6 (2011)doi101186gb-2011-12-1-r6

[69] Burdall S Hanby A Lansdown M Speirs V Breastcancer cell lines friend or foe Breast Cancer Re-search 5(2) 89ndash95 (2003) doi101186bcr577

[70] Holliday DL Speirs V Choosing the right cell linefor breast cancer research Breast Cancer Research13(4) 215 (2011) doi101186bcr2889

[71] Oumlzguumlr A Oumlzguumlr L Guumlngoumlr T Text Categorizationwith Class-Based and Corpus-Based Keyword Selec-tion Proceedings of the 20th International Confer-ence on Computer and Information Sciences 3733606ndash615 (2005) doi10100711569596

[72] Snijders AM Nowak N Segraves R BlackwoodS Brown N Conroy J Hamilton G Hindle AKHuey B Kimura K Law S Myambo K Palmer JYlstra B Yue JP Gray JW Jain AN Pinkel DAlbertson DG Assembly of microarrays for genome-wide measurement of DNA copy number Nature Ge-netics 29(3) 263ndash4 (2001) doi101038ng754

[73] Mahmud MP Schliep A Speeding up BayesianHMM by the four Russians method In Pro-ceedings of the 11th International Conferenceon Algorithms in Bioinformatics pp 188ndash200(2011) httpdlacmorgcitationcfmid=20399452039962

[74] Daubechies I Ten Lectures on Wavelets vol 61p 357 (1992) doi10113719781611970104httpepubssiamorgdoibook10113719781611970104

[75] Mallat SG A Wavelet Tour of Signal Process-ing The Sparse Way Academic Press Burling-ton MA (2009) httpdlacmorgcitationcfmid=1525499

[76] Haar A Zur Theorie der orthogonalen Funktionen-systeme Mathematische Annalen 69(3) 331ndash371(1910) doi101007BF01456326

[77] Mallat SG A theory for multiresolution signal de-composition the wavelet representation IEEE Trans-actions on Pattern Analysis and Machine Intelligence11(7) 674ndash693 (1989) doi10110934192463

[78] Mallat SG Multiresolution approximations andwavelet orthonormal bases of L2(R) Transactions ofthe American Mathematical Society 315(1) 69ndash69(1989) doi101090S0002-9947-1989-1008470-5

[79] Donoho DL Johnstone IM Ideal spatial adapta-tion by wavelet shrinkage Biometrika 81(3) 425ndash455(1994) doi101093biomet813425

18

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint

Page 19: Fast Bayesian Inference of Copy Number Variants using ... · 7/31/2015  · Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression John

[80] Donoho DL Johnstone IM Asymptotic minimax-ity of wavelet estimators with sampled data StatisticaSinica 9 1ndash32 (1999)

[81] Donoho DL Johnstone IM Minimax estimationvia wavelet shrinkage The Annals of Statistics 26(3)879ndash921 (1998)

[82] Donoho DL Johnstone IM Threshold se-lection for wavelet shrinkage of noisy data InProceedings of 16th Annual International Con-ference of the IEEE Engineering in Medicineand Biology Society pp 24ndash25 IEEE BaltimoreMD (1994) doi101109IEMBS1994412133httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=412133

[83] Donoho DL Johnstone IM Kerkyacharian G Pi-card D Wavelet Shrinkage Asymptopia Journalof the Royal Statistical Society Series B (StatisticalMethodology) 57(2) 301ndash369 (1995)

[84] Percival DB Mofjeld HO Analysis of SubtidalCoastal Sea Level Fluctuations Using Wavelets Jour-nal of the American Statistical Association 92(439)868 (1997) doi1023072965551

[85] Serroukh A Walden AT Percival DB Statisti-cal Properties and Uses of the Wavelet Variance Esti-mator for the Scale Analysis of Time Series (2012)doi10108001621459200010473913

[86] Luo J Schumacher M Scherer A Sanoudou DMegherbi D Davison T Shi T Tong W Shi LHong H Zhao C Elloumi F Shi W Thomas RLin S Tillinghast G Liu G Zhou Y HermanD Li Y Deng Y Fang H Bushel P Woods MZhang J A comparison of batch effect removal meth-ods for enhancement of prediction performance us-ing MAQC-II microarray gene expression data ThePharmacogenomics Journal 10(4) 278ndash91 (2010)doi101038tpj201057

[87] Barry DA Parlange J-Y Li L Prommer H Cun-ningham CJ Stagnitti F Analytical approximationsfor real values of the Lambert W-function Mathemat-ics and Computers in Simulation 53(1-2) 95ndash103(2000) doi101016S0378-4754(00)00172-5

19

CC-BY-ND 40 International licensecertified by peer review) is the authorfunder It is made available under aThe copyright holder for this preprint (which was notthis version posted July 31 2015 httpsdoiorg101101023705doi bioRxiv preprint