Upload
pashu-dewailly-christensen
View
395
Download
1
Embed Size (px)
Citation preview
Machine Learning Applications in Computational Genomics
— Some new algorithms for understanding cancer genomes
Jian Ma
Computational Biology Department School of Computer Science
2
TCTCTCAGAGGGCCCTGATGGAAGAATCCCCCTACCACCCTTCCAGGCTGACTTCTGTCTATTTCTCCTGCAGAGTGAGCTGGACTTGGAAAAGGGCTTGGAGATGAGAAAATGGGTCCTGTCGGGAATCCTGGCTAGCGAGGAGACTTACCTGAGCCACCTGGAGGCACTGCTGCTGGTGAGGAGGATTTAGGGAGCTGAGCAGGGCGGGATGGGGCAGGGTGACAGGGTTGGGGAGCCTCTTTGCCCTTAAGTCCCAGGTCAGCTGTCAGAGCCTGGGTGCAGCTCGCCATCCCTGGAGTGGATACCAGTGGAAGACTGAGTTGCCAAACCAAGCTGGTTTTAAAATTGTATTTGTTATGTGATTTAAAAATAAAAGTGCATATGTCAGGTAACCATGACTGTCTACTGCCATACAATGCACCTGACGGATGGCAGCCCCTCTCACCTGTGCTACCTCACTTGTGCCCTCTTCCAGCCCATGAAGCCTTTGAAAGCCGCTGCCACCACCTCTCAGCCGGTGCTGACGAGTCAGCAGATCGAGACCATCTTCTTCAAAGTGCCTGAGCTCTACGAGATCCACAAGGAGTTCTATGATGGGCTCTTCCCCCGCGTGCAGCAGTGGAGCCACCAGCAGCGGGTGGGCGACCTCTTCCAGAAGCTGGTGAGTAACCCAGGGCCGGTGCTGGGACTACAGGCGTGTACCACCACGTCCAGCTAATTTTTTGCATTTTTAGTAGAGACAGGGTTTTGCTATGTTGGCCAGGCTGGTCTCAAACTCCTAACCTCAAGTGATCCACCTGCCTCAGCCTCCCAAAGTACTGAGATTACAGGCGTGAGCCGCCATGCCCAGCCTTTTTTTTTTTTTTTCTAATTTATATTTATTTAGATAGTTATTTTTAAAAAGAGATGGGGACTTACTACGTTGTCCAGGCTGGAGTGCAGTGGCTATTCACAGGCGCAATTCCACTGCTCATCAGCACGGGAGTTTTGACCTCCTTCCTTTCCAACCTTGGCTGTTTCACTCCTTCTTAGGCAAACTGATGGTTCCCGACTCCTGGGAGGTCACCATATTGATGCCAAACTTAGTGTGTAGTGCACTACAGCCCAGAACTCCTGACTGAAGCCATCCTCCGGCCTCAGCCTTCCGCGTAGCTGGGGCTATAGGTGCACGCCACCACACCCTGTGTGTGGCTGGGACTACAGGTGCACGCCATCACACCCTGTGTGCGCCATCACACCCTGTGTGCACCATCACACCCTGTGTGCACACACTTTCCCTAAAGCAGGCTTCCTCCGCTGGGAAACAAGTCCTCTAGGGGCAGGTGTGGCCAGAGGCCAGGCCCCCCTCTAAGTGTGAAGAGCATGTGATTCCTTAAAAGCCCTTCCCCCAGCACTTCTGGACTACCGAGACACACAGCTCTGGCCTCGGGCCTCCCCTTGGCTGGTGCTGGGGGCTGAGTTTTCTGCTCTGAGGTGTGGCTTTCCTGTAGGGGGACCCCTCCCTCTGCCACCCTGTGCTGCAGACCCCCAGACTCCAGGCCAGAGCTAAGGCTTGAGGAACACAGAAGGCACTTAATTTGTTCCAGTTCTTGCTCCCTGGGGCTCTTTCCCCCATGGCCAGAGAGCAGGAGGCTGTATTTTGATACATGCTGCCCCCTCCATCTTTGAAGCCCCCCCACCCCCGTTTCTCCGTGTGTGTGTCAGCAGTTTTAAACCTAGTGGAGGGTGGTGGCTCGGGCTGGGCTCCGCGTCGGGCTGCCCCGCAGCTGCTCTTGGGCAGCCAGGGCCGCTGGGTGTGGGGCCGCCGGGAATGGCGGGCCCGGGTGAGGGCGGGCCCGGGTGAGGGCGGGGGCGGAGAGGCGAAGAAGCTGCAGGAAGGGAGGGTGACGAGGGGGAAGCGAAGGAAGGGGAAGAGGAAGGGAAAAGCGAGCGAGAGGGGCAAGGCGGAAGAGGAAGCAGGGCGGAAGGGAAGCCCGGGCCGCAGACGGCGAAGGAGGCAGCGGGCCGGGGGCTGAGGCGGGAGCGAGGACACGCCCAAGAGAGGAAGCAGAGGGAGGCGGAAGCGTGGAGGAAGGGGCGAGAGGCATCATCAAAGGAGATGAGGGGAGCGTAGGGGCCGGGAAAGAGGCACAAGGAAGAAAGTATGGGAAGGAGGAATGGAGGGTCAGGGCTAGGCGGCGGGAGGGCGCCAGGCCGGGAAGAGTACAAGGACAAGGAGGTCAGGTTTGGGCCTACATCCCGGGGACAGGGGCGGCCATGGCGGCGGCAGCCAGGGAGGAGGAGGAGGAGGCGGCTCGGGAGTCAGCCGCCTGCCCGGCTGCGGGGCCAGCGCTCTGGCGCCTGCCGGAAGTGCTGCTGCTGCACATGTGCTCCTACCTCGACATGCGGGCCCTCGGCCGCCTGGCCCAGGTGTACCGCTGGCTGTGGCACTTCACCAACTGCGACCTGCTCCGGCGCCAGATAGCCTGGGCCTCGCTCAACTCCGGCTTCACGCGGCTCGGCACCAACCTGATGACCAGTGTCCCAGTGAAGGTGTCTCAGAACTGGATAGTGGGGTGCTGCCGAGAGGGGATTCTGCTGAAGTGGAGATGCAGTCAGATGCCCTGGATGCAGCTAGAGGATGATGCTTTGTACATATCCCAGGCTAATTTCATCCTGGCCTACCAGTTCCGTCCAGATGGTGCCAGCTTGAACCGTCAGCCTCTGGGAGTCTGCTGGGCATGATGAGGACGTTTGCCACTTTGTGCTGGCCACCTCGCATATTGTCAGTGCAGGAGGAGATGGGAAGATTGGCCTTGGTAAGATTCACAGCACCTTCGCTGCCAAGTACTGGGCTCATGAACAGGAGGTGAACTGTGTGGATTGCAAAGGGGGCATCATATCATTGTGAGTGGCTCCAGGGACAGGACGGCCAAGGTGTGGCCTTTGGCCTCAGGCCAGCTGGGGTAGTGTTTATACACCATCCAGACTGAAGACCAAATCTGGTCTGTTGCTATC
Fundamental question: How the changes in genome sequences give rise to phenotypic differences (e.g., disease states)
! When they got into the genome and how they have evolved
! Their roles in genome organization and gene regulation for human biology
! Their implications in human diseases such as cancer
Our goal — from base-pairs to bedside
Why Computational Genomics?
3
Why Computational Genomics?
! Key to personalized precision medicine, especially for cancer
4
David Patterson! Cancer research has become big data science ! How to store and manage data efficiently ! How to analyze data in a distributed environment ! How to enhance data security but reduce barriers for sharing ! How to extract meaningful patterns ! How to identify mechanisms to help treatment ! …
The Human genome: the “blueprint” of our body
5
GTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATT
James WatsonFrancis Crick
February 15, 2001
March, 2011
DNA, Chromosome, and Genome
6
244 Chapter 4: DNA, Chromosomes, and Genomes
" beads-on-a-str ing "form of chromat in
30-nm chromat inf iber of packednucleosomes
Figure 4-72 Chromatin packing. Thismodel shows some of the many levels ofchromatin packing postulated to give r iseto the highly condensed mitot icchromosome.
sect ion ofchromosome inextended form
condensed sect ionof chromosome
enti remitot icchromosome
T300 nm
I
Tl 1 n mI
T30 nm
I
TI700 nm
Ii
T1400 nmI
NET RESULT: EACH DNA MOLECULE HAS BEENPACKAGED INTO A MITOTIC CHROMOSOME THAT
IS 1O,OOO-FOLD SHORTER THAN ITS EXTENDED LENGTH
Figure 4-73 The SMC proteins in condensins. (A) Electron micrographs ofa puri f ied SMC dimer. (B) The structure of a SMC dimer. The long centralregion of this protein is an antiparal lel coi led-coi l (see Figure 3-9) with af lexible hinge in i ts middle. (C) A model for the way in which the SMCproteins in condensins might compact chromatin. In real i ty, SMC proteinsare components of a much larger condensin complex. l t has beenproposed that, in the cel l , condensins coi l long str ings of looped chromatindomains (see Figure 4-57). ln this wa, the condensins could form astructural framework that maintains the DNA in a highly organized stateduring metaphase of the cel l cycle. (A, courtesy of H.P. Erickson; B and C,adapted from T. Hirano, Not. Rev. Mol. Cell Biol.7:311-322,2006. Withpermission from Macmil lan Publishers Ltd.)
retCHROMOSOMAL DNA AND ITS PACKAGING IN THE CHROMATIN FIBER
(A) (B) - r^
along each mitotic chromosome (Figure 4-f l). The structural bases for thesebanding patterns are not well understood. Nevertheless, the pattern of bands oneach type of chromosome is unique, and it is these patterns that initially allowedeach human chromosome to be identified and numbered.
The display of the 46 human chromosomes at mitosis is called the humankaryotype. If parts of chromosomes are lost or are switched between chromo-somes, these changes can be detected by changes in the banding patterns or bychanges in the pattern of chromosome painting (Figure 4-12). Cytogeneticistsuse these alterations to detect chromosome abnormalities that are associatedwith inherited defects, as well as to characterize cancers that are associated withspecific chromosome rearrangements in somatic cells (discussed in Chapter 20).
203
Figure 4-10 The complete set of humanchromosomes. These chromosomes, froma male, were isolated from a cel lundergoing nuclear division (mitosis) andare therefore highly compacted. Eachchromosome has been "painted" adif ferent color to permit i ts unambiguousidenti f icat ion under the l ight microscope.Chromosome paint ing is performed byexposing the chromosomes to a col lect ionof human DNA molecules that have beencoupled to a combination of f luorescentdyes. For example, DNA molecules derivedfrom chromosome 1 are labeled with onespecific dye combination, those fromchromosome 2 with another, and so on.Because the labeled DNA can form basepairs, or hybridize, only to thechromosome from which it was derived(discussed in Chapter 8), eachchromosome is dif ferently labeled. Forsuch experiments, the chromosomes aresubjected to treatments that separate thedouble-hel ical DNA into individual strands,designed to permit base-pair ing with thesingle-stranded labeled DNA whilekeeping the chromosome structurerelat ively intact. (A) The chromosomesvisual ized as they original ly spi l led fromthe lysed cel l . (B) The same chromosomesart i f ic ial ly l ined up in their numerical order.This arrangement of the ful l chromosomeset is cal led a karyotype. (From E. Schrocket al.. Science 273:494-497,1996. Withpermission from AAAS.)
Figure 4-1 1 The banding patterns ofhuman chromosomes. Chromosomes1-22 are numbered in approximate orderof size. A typical human somatic (non-germ-l ine) cel l contains two of each ofthese chromosomes, plus two sexchromosomes-two X chromosomes in afemale, one X and one Y chromosome in amale. The chromosomes used to makethese maps were stained at an early stagein mitosis, when the chromosomes areincompletely compacted. Th e horizontolred line represents the position of thecentromere (see Figure 4-21), whichappears as a constriction on mitoticchromosomes. The red knobs onchromosomes 13, ' l4 , 15 ,21 ,and22indicate the posit ions of genes that codefor the large r ibosomal RNAs (discussed inChapter 6). These patterns are obtained bystaining chromosomes with Giemsa stain,and they can be observed under the l ightmicroscope. (For micrographs, see Figure21 -1 8; adapted from U. Franke , Cytogenet.Cell Genet.31:24-32, 1981. With
5
2
I 5
DNA double hel ix
5' Y3'
hydrogen-bondedbase pairs
4-4). This complementary base-pairlng enables the base pairs to be packed inthe energetically most favorable arrangement in the interior of the double helix.In this arrangement, each base pair is of similar width, thus holding the sugar-phosphate backbones an equal distance apart along the DNA molecule. To max-imize the efficiency of base-pair packing, the two sugar-phosphate backbones
198 Chapter 4: DNA, Chromosomes, and Genomes
bui ld ing blocks of DNAphosphate
\ suqa r' ; +K-sugar oase
phosphaten e
double-stranded DNA
llilii:i:ilitffi $$iiiffi liiiii:ii:iii Figure 4-3 DNA and its building blocks.<CAGA> DNA is made of four types ofnucleotides, which are linked covalentlyinto a polynucleotide chain (a DNAstrand) with a sugar-phosphatebackbone from which the bases (A, C, G,and T) extend. A DNA molecule iscomposed of two DNA strands heldtogether by hydrogen bonds betweenthe paired bases.The arrowheads attheends ofthe DNA strands indicate thepolarities of the two strands, which runantiparal lel to each other in the DNAmolecule. In the diagram at the bottomleft of the figure, the DNA molecule isshown straightened out; in reality, it istwisted into a double hel ix, as shown onthe r ight. For detai ls, see Figure 4-5.
Figure 4-4 Complementary base pairs inthe DNA double hel ix. The shapes andchemical structure of the bases allowhydrogen bonds to form efficiently onlybetween A and T and between G and C.where atoms that are able to form hydrogenbonds (see Panel 2-3, pp. 1 10-1 1 1) can bebrought close together without distortingthe double hel ix. As indicated, twohydrogen bonds form between A and T,while three form between G and C.Thebases can pair in this way only i f the twopolynucleotide chains that contain themare antiparal lel to each other.
3',s',
H
N - C_ C C - N\ /
\ l I H - N
o\\
' \ - L
C \\N
\\C - C C _
,-n, , ,o' l [n,,thyminesugar-phosphate
backboneH
Ha d e n i n e
N -Hilili l i l i lO
l l. l lguanrne / H
hydrogenDOnO
cytosine
DNA, RNA, Protein
! Central Dogma in molecular biology • DNA • RNA • Protein
! In general, proteins do most of the work, and are encoded by subsequences of DNA, known as genes.
! However, only less than 2% of the human genome codes for proteins.
7
Most of the genome are non-coding
8© 2005 Nature Publishing Group
SINEs
LINEs
Protein-codinggenes
Introns
Miscellaneousunique sequences
Miscellaneousheterochromatin
Segmentalduplications
Simple sequencerepeats
DNA transposonsLTR retrotransposons
20.4%
13.1%
1.5%
25.9%
11.6%
8%
5%
3%2.9%
8.3%
At least 40 different transposable-element families are represented by young, recently active elements in the pufferfish Takifugu rubripes (formerly known as Fugu rubripes), despite its genome being among the smallest in vertebrates. But even the most com-mon type, the LINE element Maui, is present in only 6,400 copies32. In the second pufferfish to be sequenced, Tetraodon nigroviridis, only 4,000 transposable-element copies are found in total — but this still represents 73 different types of element33. Genomes that have a higher proportion of DNA transposons, such as in Drosophila melanogaster and Arabidopsis thaliana, contain elements of more recent origin that are derived from more families than in mammals16;
this is explained by the fact that DNA transposons tend to be more short-lived and to spread by hori-zontal transfer. The genome of D. melanogaster, for example, contains about 130 different transposable-element families (including 25 non-LTR and 28 LTR families), all of which are younger than 20 million years (Myr) REF. 34.
There is therefore convincing evidence that many smaller genomes contain a surprisingly high diversity of transposable-element families. It also now seems that the diversity of lineages within individual trans-posable-element families might be higher in some smaller genomes. In mammals, the abundant LINE1 elements tend to be represented by a single lineage,
Box 3 | The main components of eukaryotic genomes
Protein-coding genesAlthough most prokaryotic chromosomes consist almost entirely of protein-coding genes86, such elements make up a small fraction of most eukaryotic genomes (see figure). As a prime example, the human genome might contain as few as 20,000 genes, comprising less than 1.5% of the total genome sequence16,82.
IntronsShortly after their discovery, the non-coding intervening sequences within coding genes (introns) were suggested to account for the pronounced discrepancy between gene number and genome size7. It has also recently been suggested that most non-coding DNA in animals (but not plants) is intronic, which would imply that most of the genome is transcribed even though protein-coding regions represent a tiny minority107,108. At the very least, introns were found to account for more than a quarter of the draft human sequence16. Over a broad taxonomic scale, intron size and genome size are positively correlated109, although within genera a correlation might (for example, Drosophila110) or might not (for example, Gossypium111) be observed.
PseudogenesNon-functional copies of coding genes, the original meaning of the term ‘junk DNA’, were once thought to explain variation in genome size4. However, it is now apparent that even in combination, ‘classical pseudogenes’ (direct DNA to DNA duplicates), ‘processed pseudogenes’ (copies that are reverse transcribed back into the genome from RNA and therefore lack introns) and ‘Numts’ (nuclear pseudogenes of mitochondrial origin) comprise a relatively small portion of mammalian genomes. The human genome is estimated to contain about 19,000 pseudogenes46.
Transposable elementsIn eukaryotes, transposable elements are divided into two general classes according to their mode of transposition. Class I elements transpose through an RNA intermediate. This class comprises long interspersed nuclear elements (LINEs), endogenous retroviruses, short interspersed nuclear elements (SINEs) and long terminal repeat (LTR) retrotransposons. Class II elements transpose directly from DNA to DNA, and include DNA transposons and miniature inverted repeat transposable elements (MITEs).
Transposable elements (and especially their extinct remnants) make up a large portion of the human genome, with some elements (for example, the SINE Alu element) present in more than a million copies. Transposable-element evolution involves complex interactions with the host genome and other subgenomic elements, ranging from parasitism to mutualism. For a review of transposable-element structure, origins, impacts and evolution see REF. 17.
The figure provides a summary of the different components of the human genome. Less than 1.5% of the genome consists of the suspected 20,000–25,000 protein-coding sequences. By contrast, a large majority is made up of non-coding sequences such as introns (almost 26%) and (mostly defunct) transposable elements (nearly 45%). Data are taken from REF. 16.
702 | SEPTEMBER 2005 | VOLUME 6 www.nature.com/reviews/genetics
R E V I E W S
Nat Rev Genet, 2005
Most functional information is non-coding
! 5% highly conserved, but only 1.5% encodes proteins
9
chr2 (q31.1) 21 p14 2p12 13 31.1 q34 q35
chr2:
DLX1DLX2
Vertebrate Cons
ChimpRhesus
BushbabyTree_shrew
MouseRat
Guinea_PigShrew
HedgehogDogCat
HorseCow
ArmadilloElephant
TenrecOpossumPlatypus
LizardChicken
ZebrafishTetraodon
FuguStickleback
Medaka
172660000 172665000 172670000 172675000UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)
DLX1
GapsHumanChimp
RhesusBushbaby
Tree_shrewMouse
RatGuinea_Pig
ShrewHedgehog
DogCat
HorseCow
ArmadilloElephant
TenrecOpossumPlatypus
LizardChicken
ZebrafishTetraodon
FuguStickleback
Medaka
UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)K P R T I Y S S L Q L Q A L N
1A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A T T C C A G C T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A TA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CC C C C C T A G G A C A A T T T A T T C C A G T T T G C A G C T G G A C G C T T T G A A TA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A G C C C A G G A C A A T C T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A C T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C T A G G A C G A T T T A T T C C A G T T T G C A G C T G C A G G C T T T G A A TA A A C C C A G G A C T A T T T A T T C C A G T C T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C T A T A T A T T C C A G T T T G C A G T T G C A G G C A T T G A A CA A G C C G C G C A C C A T C T A C T C C A G C C T C C A G C T C C A G G C C T T G A A CA A A C C C A G G A C T A T T T A T T C C A G T T T G C A G C T G C A G G C T C T G A A CA A G C C C C G G A C C A T A T A C T C C A G T C T C C A G C T G C A G G C T C T G A A CA A A C C C A G G A C T A T C T A T T C C A G T T T A C A G C T C C A G G C C C T G A A CA A A C C A A G G A C T A T C T A T T C A A G T T T A C A A C T C C A A G C C C T G A A CA A A C C A A G G A C T A T C T A T T C C A G T T T A C A A C T T C A A G C T C T A A A CA A A C C A A G G A C T A T A T A T T C C A G T T T A C A G C T T C A G G C T C T G A A C
What do they do?
Annotating the non-coding regions
10
Scalechr2:
NKI LADs (Tig3)
10 kb hg1920,090,000 20,095,000 20,100,000 20,105,000
TTC32
LaminB1 (Tig3)2 -
-2 _
GM78 CHD2 IgM889 -
1 _
GM78 Pol2 IgM156.2 -
0 _
GM78 Pol2 Std259.8 -
0 _
GM78 Rad2 IgR8.7 -
0 _
GM78 TBP IgM40.1 -
0 _
GM78 Z274 Std16 -
1 _
K562 CHD2 IgR1785 -
1 _
K562 Pol2 IgM27.9 -
0 _
K562 IFa3 Pol2 Sd211.5 -
0 _
K562 IFa6 Pol2 Sd199.4 -
0 _
K562 IFg3 Pol2 Sd241.7 -
0 _
K562 IFg6 Pol2 Sd261.1 -
0 _
K562 Pol2 Std343.1 -
0 _
K562 Rad2 Std8.6 -
0 _
K562 TBP IgM397 -
1 _
K562 Z274 UCD5.4 -
0 _
NATURE METHODS | VOL.9 NO.3 | MARCH 2012 | 215
CORRESPONDENCE
ChromHMM outputs both the learned chromatin-state model parameters and the chromatin-state assignments for each genom-ic position. The learned emission and transition parameters are returned in both text and image format (Fig. 1), automatically grouping chromatin states with similar emission parameters or proximal genomic locations, although a user-specified reordering can also be used (Supplementary Figs. 1–2 and Supplementary Note). ChromHMM enables the study of the likely biological roles of each chromatin state based on enrichment in diverse external annotations and experimental data, shown as heat maps and tables (Fig. 1), both for direct genomic overlap and at vari-ous distances from a chromatin state (Supplementary Fig. 3). ChromHMM also generates custom genome browser tracks6 that show the resulting chromatin-state segmentation in dense view (single color-coded track) or expanded view (each state shown separately) (Fig. 1). All the files ChromHMM produces by default are summarized on a webpage (Supplementary Data).
ChromHMM also enables the analysis of chromatin states across multiple cell types. When the chromatin marks are com-mon across the cell types, a common model can be learned by a virtual ‘concatenation’ of the chromosomes of all cell types. Alternatively a model can be learned by a virtual ‘stacking’ of all marks across cell types, or independent models can be learned in each cell type. Lastly, ChromHMM supports the comparison of models with different number of chromatin states based on cor-relations in their emission parameters (Supplementary Fig. 4).
We wrote the software in Java, which allows it to be run on virtually any computer. ChromHMM and additional documenta-tion is freely available at http://compbio.mit.edu/ChromHMM/.
ChromHMM: automating chromatin-state discovery and characterizationTo the Editor: Chromatin-state annotation using combinations of chromatin modification patterns has emerged as a powerful approach for discovering regulatory regions and their cell type–specific activity patterns and for interpreting disease-association studies1–5. However, the computational challenge of learning chromatin-state models from large numbers of chromatin modi-fication datasets in multiple cell types still requires extensive bio-informatics expertise. To address this challenge, we developed ChromHMM, an automated computational system for learning chromatin states, characterizing their biological functions and correlations with large-scale functional datasets and visualizing the resulting genome-wide maps of chromatin-state annotations.
ChromHMM is based on a multivariate hidden Markov model that models the observed combination of chromatin marks using a product of independent Bernoulli random variables2, which enables robust learning of complex patterns of many chromatin modifications. As input, it receives a list of aligned reads for each chromatin mark, which are automatically converted into pres-ence or absence calls for each mark across the genome, based on a Poisson background distribution. One can use an optional addi-tional input of aligned reads for a control dataset to either adjust the threshold for present or absent calls, or as an additional input mark. Alternatively, the user can input files that contain calls from an independent peak caller. By default, chromatin states are ana-lyzed at 200-base-pair intervals that roughly approximate nucleo-some sizes, but smaller or larger windows can be specified. We also developed an improved parameter-initialization proce-dure that enables relatively efficient infer-ence of comparable models across differ-ent numbers of states (Supplementary Note).
Figure 1 | Sample outputs of ChromHMM. (a) Example of chromatin-state annotation tracks produced from ChromHMM and visualized in the UCSC genome browser6, including dense view (top; single track), expanded view (bottom; separate tracks). (b,c) Heat maps for model parameters (b) and for chromatin-state functional enrichments (c). The columns indicate the relative percentage of the genome represented by each chromatin state and relative fold enrichment for several types of annotation. CTCF, CTC-binding factor; WCE, whole-cell extract; TSS, transcription start site; TES, transcript end site; and GM12878 is a lymphoblastoid cell line.Data in this example correspond to a previous model learned across nine cell types3.
Scalechr4:
GM12878
1_Active_Promoter2_Weak_Promoter
3_Poised_Promoter4_Strong_Enhancer5_Strong_Enhancer6_Weak_Enhancer7_Weak_Enhancer
8_Insulator9_Txn_Transition
10_Txn_Elongation11_Weak_Txn12_Repressed
13_Heterochrom/lo14_Repetitive/CNV15_Repetitive/CNV
50 kb103650000 103700000 103750000
RefSeq Genes
GM12878 (User ordered)
GM12878 (User ordered)
NFKB1NFKB1
MANBA
a
b cEmission parameters
Sta
te (
user
ord
er)
Sta
te (
user
ord
er)
Sta
te fr
om (
user
ord
er)
Transition parameters
Mark
CT
CF
H3K
27m
e3H
3K36
me3
H4K
20m
e1H
3K4m
e1H
3K4m
e2H
3K4m
e3H
3K27
acH
3K9a
cW
CE
Gen
ome
(%)
Ref
Seq
TS
SC
pG is
land
Ref
Seq
TS
S 2
kb
Ref
Seq
exo
nR
efS
eq g
ene
Ref
Seq
TE
SC
onse
rved
Lam
ina
State to (user order)
Category
123456789
101112131415
123456789
101112131415
123456789
101112131415
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
GM12878 fold enrichments
ChromHMM — Ernst and Kellis, Nature Methods 2012
NATURE METHODS | VOL.9 NO.3 | MARCH 2012 | 215
CORRESPONDENCE
ChromHMM outputs both the learned chromatin-state model parameters and the chromatin-state assignments for each genom-ic position. The learned emission and transition parameters are returned in both text and image format (Fig. 1), automatically grouping chromatin states with similar emission parameters or proximal genomic locations, although a user-specified reordering can also be used (Supplementary Figs. 1–2 and Supplementary Note). ChromHMM enables the study of the likely biological roles of each chromatin state based on enrichment in diverse external annotations and experimental data, shown as heat maps and tables (Fig. 1), both for direct genomic overlap and at vari-ous distances from a chromatin state (Supplementary Fig. 3). ChromHMM also generates custom genome browser tracks6 that show the resulting chromatin-state segmentation in dense view (single color-coded track) or expanded view (each state shown separately) (Fig. 1). All the files ChromHMM produces by default are summarized on a webpage (Supplementary Data).
ChromHMM also enables the analysis of chromatin states across multiple cell types. When the chromatin marks are com-mon across the cell types, a common model can be learned by a virtual ‘concatenation’ of the chromosomes of all cell types. Alternatively a model can be learned by a virtual ‘stacking’ of all marks across cell types, or independent models can be learned in each cell type. Lastly, ChromHMM supports the comparison of models with different number of chromatin states based on cor-relations in their emission parameters (Supplementary Fig. 4).
We wrote the software in Java, which allows it to be run on virtually any computer. ChromHMM and additional documenta-tion is freely available at http://compbio.mit.edu/ChromHMM/.
ChromHMM: automating chromatin-state discovery and characterizationTo the Editor: Chromatin-state annotation using combinations of chromatin modification patterns has emerged as a powerful approach for discovering regulatory regions and their cell type–specific activity patterns and for interpreting disease-association studies1–5. However, the computational challenge of learning chromatin-state models from large numbers of chromatin modi-fication datasets in multiple cell types still requires extensive bio-informatics expertise. To address this challenge, we developed ChromHMM, an automated computational system for learning chromatin states, characterizing their biological functions and correlations with large-scale functional datasets and visualizing the resulting genome-wide maps of chromatin-state annotations.
ChromHMM is based on a multivariate hidden Markov model that models the observed combination of chromatin marks using a product of independent Bernoulli random variables2, which enables robust learning of complex patterns of many chromatin modifications. As input, it receives a list of aligned reads for each chromatin mark, which are automatically converted into pres-ence or absence calls for each mark across the genome, based on a Poisson background distribution. One can use an optional addi-tional input of aligned reads for a control dataset to either adjust the threshold for present or absent calls, or as an additional input mark. Alternatively, the user can input files that contain calls from an independent peak caller. By default, chromatin states are ana-lyzed at 200-base-pair intervals that roughly approximate nucleo-some sizes, but smaller or larger windows can be specified. We also developed an improved parameter-initialization proce-dure that enables relatively efficient infer-ence of comparable models across differ-ent numbers of states (Supplementary Note).
Figure 1 | Sample outputs of ChromHMM. (a) Example of chromatin-state annotation tracks produced from ChromHMM and visualized in the UCSC genome browser6, including dense view (top; single track), expanded view (bottom; separate tracks). (b,c) Heat maps for model parameters (b) and for chromatin-state functional enrichments (c). The columns indicate the relative percentage of the genome represented by each chromatin state and relative fold enrichment for several types of annotation. CTCF, CTC-binding factor; WCE, whole-cell extract; TSS, transcription start site; TES, transcript end site; and GM12878 is a lymphoblastoid cell line.Data in this example correspond to a previous model learned across nine cell types3.
Scalechr4:
GM12878
1_Active_Promoter2_Weak_Promoter
3_Poised_Promoter4_Strong_Enhancer5_Strong_Enhancer6_Weak_Enhancer7_Weak_Enhancer
8_Insulator9_Txn_Transition
10_Txn_Elongation11_Weak_Txn12_Repressed
13_Heterochrom/lo14_Repetitive/CNV15_Repetitive/CNV
50 kb103650000 103700000 103750000
RefSeq Genes
GM12878 (User ordered)
GM12878 (User ordered)
NFKB1NFKB1
MANBA
a
b cEmission parameters
Sta
te (
user
ord
er)
Sta
te (
user
ord
er)
Sta
te fr
om (
user
ord
er)
Transition parameters
Mark
CT
CF
H3K
27m
e3H
3K36
me3
H4K
20m
e1H
3K4m
e1H
3K4m
e2H
3K4m
e3H
3K27
acH
3K9a
cW
CE
Gen
ome
(%)
Ref
Seq
TS
SC
pG is
land
Ref
Seq
TS
S 2
kb
Ref
Seq
exo
nR
efS
eq g
ene
Ref
Seq
TE
SC
onse
rved
Lam
ina
State to (user order)
Category
123456789
101112131415
123456789
101112131415
123456789
101112131415
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
GM12878 fold enrichments
Cancer genomics workflow
Each type of cancer is different
12
divergence time. The number of mutations hasbeen measured in tumors representing progressivestages of colorectal and pancreatic cancers (11, 16).Applying the evolutionary clock model to thesedata leads to two unambiguous conclusions: First,it takes decades to develop a full-blown, meta-static cancer. Second, virtually all of themutationsin metastatic lesions were already present in alarge number of cells in the primary tumors.
The timing of mutations is relevant to ourunderstanding of metastasis, which is responsiblefor the death of most patients with cancer. Theprimary tumor can be surgically removed, but theresidual metastatic lesions—often undetectable andwidespread—remain and eventually enlarge, com-promising the function of the lungs, liver, or otherorgans. From a genetics perspective, it wouldseem that there must be mutations that convert aprimary cancer to a metastatic one, just as thereare mutations that convert a normal cell to a be-nign tumor, or a benign tumor to a malignant one(Fig. 2). Despite intensive effort, however, con-sistent genetic alterations that distinguish cancersthat metastasize from cancers that have not yetmetastasized remain to be identified.
One potential explanation invokes mutationsor epigenetic changes that are difficult to iden-tify with current technologies (see section on “darkmatter” below). Another explanation is that meta-static lesions have not yet been studied in suf-ficient detail to identify these genetic alterations,particularly if the mutations are heterogeneousin nature. But another possible explanation isthat there are no metastasis genes. A malignantprimary tumor can take many years to metasta-size, but this process is, in principle, explicableby stochastic processes alone (17, 18). Advancedtumors release millions of cells into the circula-tion each day, but these cells have short half-lives,and only a miniscule fraction establish metastaticlesions (19). Conceivably, these circulating cellsmay, in a nondeterministic manner, infrequentlyand randomly lodge in a capillary bed in an organthat provides a favorable microenvironment forgrowth. The bigger the primary tumor mass, themore likely that this process will occur. In thisscenario, the continual evolution of the primarytumor would reflect local selective advantagesrather than future selective advantages. The ideathat growth at metastatic sites is not dependent onadditional genetic alterations is also supported byrecent results showing that even normal cells,when placed in suitable environments such aslymph nodes, can grow into organoids, completewith a functioning vasculature (20).
Other Types of Genetic Alterations in TumorsThough the rate of point mutations in tumors issimilar to that of normal cells, the rate of chro-mosomal changes in cancer is elevated (21).Therefore, most solid tumors display widespreadchanges in chromosome number (aneuploidy),as well as deletions, inversions, translocations,
1500
1000
500
Col
orec
tal (
MS
I)
Lung
(S
CLC
)
Lung
(N
SC
LC)
Mel
anom
a
Eso
phag
eal (
ES
CC
)
Non
-Hod
gkin
lym
phom
a
Col
orec
tal (
MS
S)
Hea
d an
d ne
ck
Eso
phag
eal (
EA
C)
Gas
tric
End
omet
rial (
endo
met
rioid
)
Pan
crea
tic a
deno
carc
inom
a
Ova
rian
(hig
h-gr
ade
sero
us)
Pro
stat
e
Hep
atoc
ellu
lar
Glio
blas
tom
a
Bre
ast
End
omet
rial (
sero
us)
Lung
(ne
ver
smok
ed N
SC
LC)
Chr
onic
lym
phoc
ytic
leuk
emia
Acu
te m
yelo
id le
ukem
ia
Glio
blas
tom
a
Neu
robl
asto
ma
Acu
te ly
mph
obla
stic
leuk
emia
Med
ullo
blas
tom
a
Rha
bdoi
d
Mutagens
Non
-syn
onym
ous
mut
atio
ns p
er tu
mor
(med
ian
+/-
one
quar
tile)
250
225
200
175
150
125
100
50
75
25
0
A
B
Adult solid tumors Liquid Pediatric
Fig. 1. Number of somatic mutations in representative human cancers, detected by genome-wide sequencing studies. (A) The genomes of a diverse group of adult (right) and pediatric (left)cancers have been analyzed. Numbers in parentheses indicate the median number of nonsynonymousmutations per tumor. (B) The median number of nonsynonymous mutations per tumor in a variety oftumor types. Horizontal bars indicate the 25 and 75% quartiles. MSI, microsatellite instability; SCLC,small cell lung cancers; NSCLC, non–small cell lung cancers; ESCC, esophageal squamous cell carcinomas;MSS, microsatellite stable; EAC, esophageal adenocarcinomas. The published data on which this figure isbased are provided in table S1C.
www.sciencemag.org SCIENCE VOL 339 29 MARCH 2013 1547
SPECIALSECTION
CRE
DIT:F
IG.1
A,E
.COOK
Number of nonsynonymous mutations in representative human cancers, detected by genome-wide sequencing studies.
Vogelstein et al. Science 2013
Each individual tumor is different
! Data from TCGA’s analyses show that most cancer types has a great number of mutations that occur at a low frequency.
! Long-tail distribution
13
2www.nature.com/nature
doi: 10.1038/nature08645 SUPPLEMENTARY INFORMATION
������ ������� ������ � ���� � ���� � ����
������� ����� � ����� � ���� � ���� � �����
������ ������������������������������������ ����� � ����� � ������������ � �����
������� ����������������������������������� ����� � ���� � ��� � �����
SI Guide
Supplementary Figure 1 Haploid physical coverage of breast cancer samples. Physical
coverage indicates the number of DNA fragments of which both ends have been sequenced
that on average overlie any position in the genome.
Supplementary Figure 2 Genome wide circos plots of somatic rearrangements in all 24
breast cancers in the study.
Supplementary Figure 3 NFIA-EHF, an expressed, in frame fusion gene caused by an
interchromosomal rearrangement in breast cancer cell line HCC1937. (a) Across
rearrangement PCR to confirm the presence of the somatic rearrangement in the cancer but
not in normal DNA; (b) RT-PCR of RNA between NFIA exon 2 and EHF exon 5 to confirm
the presence of a chimeric expressed transcript; (c) Representative picture of dual colour
FISH confirming a translocation in HCC1937. Red probe corresponds to BAC RP11-
364M11, chromosome 1: 61,064,196-61,228,554. Green probe corresponds to BAC RP11-
277N08, chromosome 11: 34,772,104-34,965,946. (d) Schematic diagram of the protein
domains fused in the predicted NFIA/EHF fusion protein. Domains from NFIA are blue,
domains from EHF are red (e) Sequence from RT-PCR product shown in (b) confirming
NFIA exon 2 fused to EHF exon 5.
Supplementary Figure 4 SLC26A6-PRKAR2A, an expressed, in-frame fusion gene
generated by a tandem duplication in the breast cancer cell line HCC38. (a) Across
rearrangement PCR to confirm the presence of the somatic rearrangement in the cancer but
not in normal DNA; (b) RT-PCR of RNA between SLC26A6 exon 17 and PRKAR2A exon 4
to confirm the presence of a chimeric expressed transcript; (c) Dual colour FISH confirming
the 3p21.31 tandem duplication in HCC38. Green-labelled BAC RP11-148G20 is within the
tandem duplication. Red-labelled BAC RP11-527M10 is located ~3 Mb telomeric of the
tandem duplication. (d) Schematic diagram of the protein domains in the predicted
SLC26A6-PRKAR2A fusion protein. Domains from SLC26A6 are blue, domains from
PRKAR2A are red. (e) Sequence from RT-PCR product shown in (b) confirming SLC26A6
exon 17 fused to PRKAR2A exon 4.
Supplementary Figure 5 Prevalence of architectures of rearrangements in all 24 breast
cancers in the study: Deletion (dark blue), tandem duplication (red), inverted orientation
(green), interchromosomal (light blue), breakpoint(s) within an amplicon (orange).
Stephens et al. Nature 2009
Supervised learning Un-supervised learning
genes
samples
Analyzing gene expression data
How to deal with high dimension? Identify the most important genes
! d is the damping factor, a parameter representing the extent to which the ranking depends on the structure of the graph.
! f is the prior probability of the gene which we set to the absolute differential expression.
! is the in-degree of i
15
Gene Network
Gene Expression
Somatic Alteration Data(SNP, CNV, etc.)
Ranks of Genes
▪ A ranking framework based on PageRank that considers the impact of genes in the network
▪ Impact includes connectivity and the amount downstream genes to be differentially expressed
▪ Dynamic damping factor is used to improve the original PageRank in ranking genes
DawnRank
Personalized Driver Alterations
rt+1j = (1� dj)fj + dj
NX
i=1
Ajirtidegi
degi =PN
j=1 Aji
Hou and Ma, Genome Med 2014
Tumor heterogeneity vs. gene networks
16
NCIS - Liu et al. BMC Bioinfo 2014 C3 - Hou et al. Bioinformatics 2016 LDGM - Tian, Gu, and Ma, Nucleic Acids Res 2016
������
��
����������
��
� ��
������
���
����
�������
���
����
�
���������������
���� �����
�� ��� ���� ���� �����������������������������
������������������������������� ����
���������� �����������
����������������� �
������ ���� �����
������ �������������
���
NRAS
GABBR1
ATF2
MAPK1
PRKACAGNAI2
PRKACB
CREB3L4ADCY2
KCNJ3
PLCB4
GRB2
GNAI3
SRC
PIK3CD
CALML6
ESR1
GABBR2
ADCY4
FOSADCY3
NOS3
PLCB2
OPRM1
AKT1
GNAS
CREB3
PIK3CA
HRAS
PLCB3KCNJ6CREB3L1
GNAO1
SHC1
MAP2K1
PIK3R5
ADCY5MAPK3PLCB1PIK3R3
SOS1
GNAI1CALML3
MMP2PRKACGPRKCD
CREB3L2HBEGF
SHC4
PIK3CB
AKT3CREB5
GRM1ADCY1
MMP9
EGFRJUN
ADCY7
ATF6B
SHC2
PIK3R1
CALM1
SOS2
ADCY9ATF4
PIK3R2SHC3
SP1
# interactions
# interactions
ES
R1
degr
eeR
ank
of E
SR
1
A
B
C
Lumina A
Basal-like
% degree from Luminal A
% degree from Basal-like
LDGMGlasso
JGLCNJGL
Figure 5: Differential networks on estrogen signaling pathway reconstructed based on gene expression data from breastcancer Luminal A and Basal-like subtypes. (A) The degree of ESR1 in estimated differential networks with increasednumber of interactions. (B) The rank of ESR1 by its degree in differential networks. The number of interactions is up to1,000 in (A) and (B). (C) A differential network b� estimated by LDGM with � = 0.362. Node size is proportional to thenode’s degree. Width of an interaction i � j is proportional to the score |b�ij |. The origin of interactions in the differentialnetwork is inferred by a principle of majority approach based on Glasso (see Supplementary Text).
20
J_ID: BIOINFO Customer A_ID: BTW546 Copyedited by: PR Manuscript Category: Original Paper Cadmus Art: OP-CB
produced by C3 for small k: the IQR (Interquartile range) value is
roughly 0.35 for k¼5. The most likely reason behind this result is
that our test weights were chosen to boost the relevance of mutual-
exclusivity and biological significance rather than coverage. Mutual5 exclusivity accounts for 100% of the negative weights of edges,
while coverage accounts for only 16.7% of the positive weights. We
justify this weight choice by the fact that it leads to multiple signifi-
cant cluster discovery and with our assumption that coverage is a
less significant driver property compared to mutual exclusivity. We10 also point out that it appears that a biologically more relevant cover-
age constraint is pathway coverage, rather than patient sample
coverage.
As already mentioned in the previous sections, one advantage of
C3 is that the user can adjust the weights according to her/his own15 belief about the significance of patient coverage. For example, by
changing the averaging weights in our GBM run to w1 ¼ 0:60
(coverage), w2 ¼ 0:20 (network) and w3 ¼ 0:20 (expression), we
obtain a coverage percentage of 0.7903 for k¼5. However, this ex-
cellent coverage comes at a cost of a less significant mutual exclusiv-20 ity score (fractional value 0.4288) and a lower proportion of
detected drivers (fractional value 0.1267). As may be seen from the
above example, C3 can be adapted to the user’s specification to best
reflect the scope and preferences of the analysis.
Another setting in which we analyzed C3 and CoMEt involves25 pairwise distances of drivers in the network (see Fig. 3D). Here, we
calculated the average pairwise distance between all pairs of genes
clustered together. We then used Student’s t-test to determine the
statistical significance of this value. We also compared the values for
both algorithms based on 1000 randomly selected genes by using a30 permutation test. For BRCA, we found no significant performance
difference between the two methods in terms of the average pairwise
distance: 3.110 for C3 and 3.070 for CoMEt, with a P-value of
0.9330. In GBM, C3 showed a smaller average pairwise distance of
2.908 compared to CoMEt’s 3.097. This difference is statistically35 significant, with a P-value of 0.0379. The small average network
distance results of C3 for GBM, coupled with the low coverage,
leads to the conclusion that C3 favors niche, exclusive clusters in
biologically relevant cancer pathways. Hence, the method may be
useful for discovering specific molecular cancer subtypes. Both40 methods had an average pairwise distance well below the permuta-
tion benchmark of 3.903: the P-values of both C3 and CoMEt were
less than 2" 10#16 for both cancers.
In conclusion, from our detailed evaluation we conclude that al-
though C3 does not simultaneously outperform CoMEt with respect45 to all four evaluation criteria, but only three of them (which already
represents a significant advantage), the C3 performance indicates a
strong overall propensity to select biologically more relevant and
more mutually exclusive clusters, with a higher degree of flexibility
compared to CoMEt.
50 4.2 Discovering potential driver pathwaysWe examine next the potential of the C3 algorithm to detect clusters
whose genes may be new candidate cancer drivers. We focus our
search on clusters that contain biologically significant driver genes
and known biological network interactions, and exhibit high mutual55 exclusivity and coverage. At the same time, we only consider the large
cluster size regime, as results in this domain have not been previously
reported in the literature and as they offer many new interesting in-
sights. Two examples of our analysis are shown in Figures 4 and 5.
In BRCA, one candidate cluster with several potential novel60 driver genes is the cluster containing PTEN, HUWE1, CNTNAP2,
GRID2, CACNA1B, CYSLTR2, MYH1 depicted in Figure 4. The
genes in the candidate cluster are mutually exclusive
(P# value ¼ 0:0084). The genome landscape of this cluster is domi-
nated primarily by mutations in PTEN and HUWE1, and secondar-65ily by homozygous deletions in PTEN and CYSLTR2. The most
Fig. 4. A cluster of potential driver genes inferred from BRCA. (A) The alter-
ation landscape of the cluster, with blue representing mutation events, red
representing copy number deletions and green representing copy number
amplifications. (B) A known subnetwork which contains 6 genes (out of 7) in
(A). The more intense the red, the higher the alteration frequency of the gene.
Nodes highlighted in black represent driver candidates identified by C3 within
a small subnetwork. Edges are depicted in black if there exists a direct inter-
action between two genes. Green edges represent an interaction that under-
goes a protein state change. Purple edges are other interactions
Fig. 5. A cluster of potential driver genes inferred from GBM. (A) The alter-
ation landscape of the cluster, with blue representing mutation events, red
representing copy number deletions and green representing copy number
amplifications. (B) A known subnetwork which contains 6 genes (out of 10) in
(A). The more intense the red, the higher the alteration frequency of the gene.
Nodes highlighted in black represent driver candidates identified by C3 within
a small subnetwork. Edges are depicted in black if there exists a direct inter-
action between two genes. Green edges represent an interaction that under-
goes a protein state change. Purple edges are other interactions
10 J.P.Hou et al.
Deep learning applications
17
process (Box 1). Deep learning is now one of the most active fields in
machine learning and has been shown to improve performance
in image and speech recognition (Hinton et al, 2012; Krizhevsky et al,
2012; Graves et al, 2013; Zeiler & Fergus, 2014; Deng & Togneri, 2015),
natural language understanding (Bahdanau et al, 2014; Sutskever
et al, 2014; Lipton, 2015; Xiong et al, 2016), and most recently, in
computational biology (Eickholt & Cheng, 2013; Dahl et al, 2014;
Leung et al, 2014; Sønderby & Winther, 2014; Alipanahi et al, 2015;
Wang et al, 2015; Zhou & Troyanskaya, 2015; Kelley et al, 2016).
The potential of deep learning in high-throughput biology is
clear: in principle, it allows to better exploit the availability of
increasingly large and high-dimensional data sets (e.g. from DNA
sequencing, RNA measurements, flow cytometry or automated
microscopy) by training complex networks with multiple layers that
capture their internal structure (Fig 1C and D). The learned
networks discover high-level features, improve performance over
traditional models, increase interpretability and provide additional
understanding about the structure of the biological data.
In this review, we discuss recent and forthcoming applications of
deep learning, with a focus on applications in regulatory genomics
and biological image analysis. The goal of this review was not to
provide comprehensive background on all technical details, which
can be found in the more specialized literature (Bengio, 2012;
Bengio et al, 2013; Deng, 2014; Schmidhuber, 2015; Goodfellow
et al, 2016). Instead, we aimed to provide practical pointers and the
necessary background to get started with deep architectures, review
current software solutions and give recommendations for applying
them to data. The applications we cover are deliberately broad to
illustrate differences and commonalities between approaches;
reviews focusing on specific domains can be found elsewhere (Park
& Kellis, 2015; Gawehn et al, 2016; Leung et al, 2016; Mamoshina
et al, 2016). Finally, we discuss both the potential and possible
pitfalls of deep learning and contrast these methods to traditional
machine learning and classical statistical analysis approaches.
Deep learning for regulatory genomics
Conventional approaches for regulatory genomics relate sequence
variation to changes in molecular traits. One approach is to leverage
variation between genetically diverse individuals to map quantitative
trait loci (QTL). This principle has been applied to identify regulatory
variants that affect gene expression levels (Montgomery et al, 2010;
Pickrell et al, 2010), DNA methylation (Gibbs et al, 2010; Bell et al,
2011), histone marks (Grubert et al, 2015; Waszak et al, 2015) and
proteome variation (Vincent et al, 2010; Albert et al, 2014; Parts et al,
2014; Battle et al, 2015) (Fig 2A). Better statistical methods have
helped to increase the power to detect regulatory QTL (Kang et al,
2008; Stegle et al, 2010; Parts et al, 2011; Rakitsch & Stegle, 2016);
however, any mapping approach is intrinsically limited to variation that
is present in the training population. Thus, studying the effects of rare
mutations in particular requires data sets with very large sample size.
An alternative is to train models that use variation between
regions within a genome (Fig 2A). Splitting the sequence into
windows centred on the trait of interest gives rise to tens of thou-
sands of training examples for most molecular traits even when
using a single individual. Even with large data sets, predicting molec-
ular traits from DNA sequence is challenging due to multiple layers
x y
Features Model Results Clean data
A
D
Featureextraction
Discriminative features
Raw data
Label
C
Intr
on
Exon
Featureextraction Training Evaluation
Supervised Unsupervised
x
• Linear regression• Logistic regression• Random Forest• SVM• …
• PCA• Factor analysis• Clustering• Outlier detection• …
B
A C G T C G C G T A G T C C G T T A G T C G T A G G A G A A
TAGC
TGC
ACG
TGAC
CATG
AGTC
ATGC
TGC
GTC
CGT
AT
CGAT
GTC
CGA
GTACA
CCACC
GAGTG
TGT
CATG
CTA
CAG
CTA
T
GC
GCTA
GCTG
ACTG
AC
TA
TCG
GCT
ATG
CAG
AGC
ACG
A
CGGC
TCG
ATGC
CT
GATC
CCAG
TAGC
TAGC
TACC
AGCC
AGCT
CTGAC
GTC
TACG
ATCG
TGAC
ATCGG
CAGC
AT
GGC
AGC
ATCG
TACG
ATCG
ATGC
ACGT
CGAT
TGA
TAG
ACG
C
GACTG
ATC
ATG
ACTG
TAG
CGTA
GCTA
GCT
CGA
CAT
CGAT
GAT
TCA
TA
GAT
CTA
CGTA
Layer 1 ATG
CAG
AGC
ACG
A
CGGC
TCG
ATGC
CT
GATC
CGTA
GCTA
GCT
CGA
CAT
CGAT
GAT
TCA
TA
GAT
CTA
CGTA
Raw data
Pre-processing
Raw data
Layer 2 Intron ExonTSS
Figure 1. Machine learning and representation learning.(A) The classical machine learning workflow can be broken down into four steps: data pre-processing, feature extraction, model learning and model evaluation. (B) Supervisedmachine learning methods relate input features x to an output label y, whereas unsupervised method learns factors about x without observed labels. (C) Raw input data areoften high-dimensional and related to the corresponding label in a complicated way, which is challenging for many classical machine learning algorithms (left plot).Alternatively, higher-level features extracted using a deep model may be able to better discriminate between classes (right plot). (D) Deep networks use a hierarchicalstructure to learn increasingly abstract feature representations from the raw data.
Molecular Systems Biology 12: 878 | 2016 ª 2016 The Authors
Molecular Systems Biology Deep learning for computational biology Christof Angermueller et al
2
Published online: July 29, 2016
process (Box 1). Deep learning is now one of the most active fields in
machine learning and has been shown to improve performance
in image and speech recognition (Hinton et al, 2012; Krizhevsky et al,
2012; Graves et al, 2013; Zeiler & Fergus, 2014; Deng & Togneri, 2015),
natural language understanding (Bahdanau et al, 2014; Sutskever
et al, 2014; Lipton, 2015; Xiong et al, 2016), and most recently, in
computational biology (Eickholt & Cheng, 2013; Dahl et al, 2014;
Leung et al, 2014; Sønderby & Winther, 2014; Alipanahi et al, 2015;
Wang et al, 2015; Zhou & Troyanskaya, 2015; Kelley et al, 2016).
The potential of deep learning in high-throughput biology is
clear: in principle, it allows to better exploit the availability of
increasingly large and high-dimensional data sets (e.g. from DNA
sequencing, RNA measurements, flow cytometry or automated
microscopy) by training complex networks with multiple layers that
capture their internal structure (Fig 1C and D). The learned
networks discover high-level features, improve performance over
traditional models, increase interpretability and provide additional
understanding about the structure of the biological data.
In this review, we discuss recent and forthcoming applications of
deep learning, with a focus on applications in regulatory genomics
and biological image analysis. The goal of this review was not to
provide comprehensive background on all technical details, which
can be found in the more specialized literature (Bengio, 2012;
Bengio et al, 2013; Deng, 2014; Schmidhuber, 2015; Goodfellow
et al, 2016). Instead, we aimed to provide practical pointers and the
necessary background to get started with deep architectures, review
current software solutions and give recommendations for applying
them to data. The applications we cover are deliberately broad to
illustrate differences and commonalities between approaches;
reviews focusing on specific domains can be found elsewhere (Park
& Kellis, 2015; Gawehn et al, 2016; Leung et al, 2016; Mamoshina
et al, 2016). Finally, we discuss both the potential and possible
pitfalls of deep learning and contrast these methods to traditional
machine learning and classical statistical analysis approaches.
Deep learning for regulatory genomics
Conventional approaches for regulatory genomics relate sequence
variation to changes in molecular traits. One approach is to leverage
variation between genetically diverse individuals to map quantitative
trait loci (QTL). This principle has been applied to identify regulatory
variants that affect gene expression levels (Montgomery et al, 2010;
Pickrell et al, 2010), DNA methylation (Gibbs et al, 2010; Bell et al,
2011), histone marks (Grubert et al, 2015; Waszak et al, 2015) and
proteome variation (Vincent et al, 2010; Albert et al, 2014; Parts et al,
2014; Battle et al, 2015) (Fig 2A). Better statistical methods have
helped to increase the power to detect regulatory QTL (Kang et al,
2008; Stegle et al, 2010; Parts et al, 2011; Rakitsch & Stegle, 2016);
however, any mapping approach is intrinsically limited to variation that
is present in the training population. Thus, studying the effects of rare
mutations in particular requires data sets with very large sample size.
An alternative is to train models that use variation between
regions within a genome (Fig 2A). Splitting the sequence into
windows centred on the trait of interest gives rise to tens of thou-
sands of training examples for most molecular traits even when
using a single individual. Even with large data sets, predicting molec-
ular traits from DNA sequence is challenging due to multiple layers
x y
Features Model Results Clean data
A
D
Featureextraction
Discriminative features
Raw data
Label
C
Intr
on
Exon
Featureextraction Training Evaluation
Supervised Unsupervised
x
• Linear regression• Logistic regression• Random Forest• SVM• …
• PCA• Factor analysis• Clustering• Outlier detection• …
B
A C G T C G C G T A G T C C G T T A G T C G T A G G A G A A
TAGC
TGC
ACG
TGAC
CATG
AGTC
ATGC
TGC
GTC
CGT
AT
CGAT
GTC
CGA
GTACA
CCACC
GAGTG
TGT
CATG
CTA
CAG
CTA
T
GC
GCTA
GCTG
ACTG
AC
TA
TCG
GCT
ATG
CAG
AGC
ACG
A
CGGC
TCG
ATGC
CT
GATC
CCAG
TAGC
TAGC
TACC
AGCC
AGCT
CTGAC
GTC
TACG
ATCG
TGAC
ATCGG
CAGC
AT
GGC
AGC
ATCG
TACG
ATCG
ATGC
ACGT
CGAT
TGA
TAG
ACG
C
GACTG
ATC
ATG
ACTG
TAG
CGTA
GCTA
GCT
CGA
CAT
CGAT
GAT
TCA
TA
GAT
CTA
CGTA
Layer 1 ATG
CAG
AGC
ACG
A
CGGC
TCG
ATGC
CT
GATC
CGTA
GCTA
GCT
CGA
CAT
CGAT
GAT
TCA
TA
GAT
CTA
CGTA
Raw data
Pre-processing
Raw data
Layer 2 Intron ExonTSS
Figure 1. Machine learning and representation learning.(A) The classical machine learning workflow can be broken down into four steps: data pre-processing, feature extraction, model learning and model evaluation. (B) Supervisedmachine learning methods relate input features x to an output label y, whereas unsupervised method learns factors about x without observed labels. (C) Raw input data areoften high-dimensional and related to the corresponding label in a complicated way, which is challenging for many classical machine learning algorithms (left plot).Alternatively, higher-level features extracted using a deep model may be able to better discriminate between classes (right plot). (D) Deep networks use a hierarchicalstructure to learn increasingly abstract feature representations from the raw data.
Molecular Systems Biology 12: 878 | 2016 ª 2016 The Authors
Molecular Systems Biology Deep learning for computational biology Christof Angermueller et al
2
Published online: July 29, 2016
More traditional Machine Learning Applications to Deep Learning Application
Angermueller et al., Mol Sys Bio 2016
DeepBIND
18
832 VOLUME 33 NUMBER 8 AUGUST 2015 NATURE BIOTECHNOLOGY
A N A LY S I S
identify cumulative effects of short motifs, and the contribution of each is determined automatically by learning. These values are fed into a nonlinear neural network with weights W, which combines the responses to produce a score (Fig. 2a, Supplementary Fig. 1 and Supplementary Notes, sec. 1.1).
We use deep learning techniques13–16 to infer model parameters and to optimize algorithm settings. Our training pipeline (Fig. 2b) alleviates the need for hand tuning, by automatically adjusting many calibration parameters, such as the learning rate, the degree of momentum14, the mini-batch size, the strength of parameter regu-larization, and the dropout probability15.
To obtain the results reported below, we trained DeepBind mod-els on a combined 12 terabases of sequence data, spanning thou-sands of public PBM, RNAcompete, ChIP-seq and HT-SELEX experiments. We provide the source code for DeepBind together with an online repository (http://tools.genes.toronto.edu/deepbind/) of 927 DeepBind models rep-resenting 538 distinct transcription factors and 194 distinct RBPs, each of which was trained on high-quality data and can be applied to score new sequences using an eas-ily installed executable file with no hardware or software requirements.
Ascertaining DNA sequence specificitiesTo evaluate DeepBind’s ability to charac-terize DNA-binding protein specificity, we used PBM data from the revised DREAM5 TF-DNA Motif Recognition Challenge by Weirauch et al.17. The PBM data represent 86 different mouse transcription factors, each measured using two independent array designs. Both designs contain ~40,000 probes that cover all possible 10-mers, and all nonpalindromic 8-mers, 32 times. Participating teams were asked to train on the probe intensities using one array design and to predict the intensities of the held-out array design, which was not made available to participants.
Weirauch et al.17 evaluated 26 algorithms that can be trained on PBM measurements, including FeatureREDUCE17, BEEML-PBM18, MatrixREDUCE19, RankMotif++20 and Seed-and-Wobble21. For each indi-vidual algorithm, they optimized the data preprocessing steps to attain best test per-formance. Methods were evaluated using the Pearson correlation between the predicted and actual probe intensities, and values from the area under the receiver operating char-acteristic (ROC) curve (AUC) computed by setting high-intensity probes as positives and the remaining probes as negatives17. To the best of our knowledge, this is the largest independent evaluation of this type. When we tested DeepBind under the same conditions, it outperformed all 26 methods (Fig. 3a). DeepBind also ranked first among 15 teams
when we submitted it to the online DREAM5 evaluation script (Supplementary Table 1).
To assess the ability of DeepBind models trained using in vitro PBM data to predict sequence specificities measured using in vivo ChIP-seq data, we followed the method described by Weirauch et al.17. Predicting transcription factor binding in vivo is more difficult because it is affected by other proteins, the chromatin state and the physical accessibility of the binding site. We found that DeepBind also achieves the highest score when applied to the in vivo ChIP-seq data (Fig. 3b and Supplementary Fig. 2). The best method reported in the original evaluation (Team_D, a k-mer-based model) and the best reported in the revised evaluation (FeatureREDUCE, a hybrid PWM/k-mer model) both had reasonable, but not the best, perform-ance on in vivo data, which might be due to overfitting to PBM noise17.
Current batch of inputs
Motif scans
Convolve
Motifdetectors
Rectify
Thresholds
Current modelparameters
Pool
Weights
Neural network
Features Outputs
Targets
Parameterupdates
UpdateBackprop
Predictionerrors
Use all training data TrainingAUC
Evaluaterandom
calibrations
(1) (2)
(2)
(2)
(2)
(30)
...
1. Calibrate
0.62
0.95
0.70
... ...
Use bestcalibration
(3 attempts)
2. Train candidates
0.96
0.50
0.97
3. Test final model
Predict 0.93
TestAUC
Use parametersof best candidate
M b w
3-fold cross validation Averagevalidation
AUC
0.70 Test data never seen during calibration or training
Testdata
Trainingdata
Train
Train Train
Train
Validate
Validate
Validate
0.97Train
a
b
Figure 2 Details of inner workings of DeepBind and its training procedure. (a) Five independent sequences being processed in parallel by a single DeepBind model. The convolve, rectify, pool and neural network stages predict a separate score for each sequence using the current model parameters (Supplementary Notes, sec. 1). During the training phase, the backprop and update stages simultaneously update all motifs, thresholds and network weights of the model to improve prediction accuracy. (b) The calibration, training and testing procedure used throughout (Supplementary Notes, sec. 2).
Figure 1 DeepBind’s input data, training procedure and applications. 1. The sequence specificities of DNA- and RNA-binding proteins can now be measured by several types of high-throughput assay, including PBM, SELEX, and ChIP- and CLIP-seq techniques. 2. DeepBind captures these binding specificities from raw sequence data by jointly discovering new sequence motifs along with rules for combining them into a predictive binding score. Graphics processing units (GPUs) are used to automatically train high-quality models, with expert tuning allowed but not required. 3. The resulting DeepBind models can then be used to identify binding sites in test sequences and to score the effects of novel mutations.
Alipanahi et al. Nat Biotech 2015 (Other methods: DeepSEA — Zhou & Troyanskaya, Nat Methods 2015;
DanQ — Quang & Xie, Nucleic Acids Res 2016)
Cancer genome
19
MCF-7 http://www.path.cam.ac.uk/~pawefish/
20
Structural variations (SVs) in cancer genomes
inversion translocation
gain loss duplication
Whole genome sequencing Methods: DELLY, Meerkat, BreakDancer, CREST, CNVnator, CONSERTING, and many others
Aneuploidy — Common feature of cancer cells
21
MCF-7 http://www.path.cam.ac.uk/~pawefish/
! Allele-specific copy number (ASCN) tools • ABSOLUTE, ASCAT,
Patchwork
! SVs can further modify the aneuploid cancer genome into a mixture of genomic segments with extensive range of CNAs
! We need methods that combine SV and ASCN
! How SVs interact with ASCNs? How different SVs interact with each other?
NATURE GENETICS VOLUME 45 | NUMBER 10 | OCTOBER 2013 1135
A N A LY S I S
We then inferred the sequence of SCNA events that led to each copy number profile, using the most parsimonious set of SCNAs that could generate the observed absolute allelic copy numbers (Online Methods and Supplementary Fig. 1a). We determined the lengths, locations and numbers of copies changed for each SCNA and, in many cases, allelic structure (Supplementary Fig. 1b). We identified a total of 202,244 SCNAs, a median of 39 per cancer sample, comprising 6 categories: focal SCNAs that were shorter than the chromosome arm (median of 11 amplifications and 12 deletions per sample); arm-level SCNAs that were chromosome-arm length or longer (median of 3 amplifications and 5 deletions per sample); copy-neutral loss-of-heterozygosity (LOH) events in which one allele was deleted and the other was amplified coextensively (median of 1 per sample); and whole-genome duplications (WGDs; in 37% of cancers). By ampli-fications and deletions, we refer to copy number gains and losses, respectively, of any length and amplitude.
Estimated purities and ploidies per cancer varied substantially within and across lineages (Fig. 1a). Purity estimates correlated with estimates derived from measurements of leukocyte and lymphocyte contamination using DNA methylation data from the same can-cers (Supplementary Fig. 1c) (H.S., L. Yao, T. Tiche Jr., T. Hinoue, C. Kandoth et al., unpublished data) but tended to indicate lower purity, consistent with the presence of non-hematopoietic contaminating nor-mal cells. Average ploidies within lineages mirrored WGD frequencies. The average estimated ploidy within samples that had undergone a single WGD was 3.31 (not 4), suggesting that WGD events are associ-ated with large amounts of genome loss. By contrast, samples that had not undergone WGD had an average estimated ploidy of 1.99.
Compared to the near-diploid cancers within each lineage, cancers with WGD had higher rates of every other type of SCNA (Fig. 1b) and twice the rate of SCNAs overall. Across lineages, overall SCNA rates largely reflected rates of WGD (Supplementary Fig. 1d).
In cancers with WGD, most other SCNAs occurred after WGD (Fig. 1b and Online Methods). The fractions of amplifications and deletions that were estimated to occur before WGD were highly cor-related across lineages (R = 0.64; Supplementary Fig. 1e), indicating a consistent estimate for the timing of WGD with respect to other SCNAs. WGD was inferred to occur earliest relative to focal SCNAs among lineages where WGD was common (ovarian, bladder and colo-rectal cancers) and after most focal SCNAs in lineages in which WGD was least common (glioblastoma and kidney clear-cell carcinoma).
SCNA lengths suggest varied mechanisms of generationFocal SCNAs for which one boundary is the telomere (telomere bounded) tended to be longer than SCNAs for which both boundaries were internal to the chromosome (median SCNA lengths for telomere-bounded and internal events respectively: amplifications, 19.6 Mb versus 0.9 Mb; deletions, 22.7 Mb versus 0.7 Mb). These differences reflect
differences across the entire length distributions of telomere-bounded and internal events. Focal internal SCNAs were observed at frequen-cies inversely proportional to their lengths (Fig. 2a and Supplementary Fig. 2a,b), as noted previously1. However, telomere-bounded SCNAs tended to follow a superposition of 1/length and uniform length distri-butions. These distributions were the same whether measuring distance by kilobase, number of array markers or number of genes, indicating that this difference in length does not result from variation in array resolution or gene density across the genome (data not shown). Focal, telomere-bounded SCNAs also accounted for more SCNAs than expected assum-ing random SCNA locations (12% and 26% of focal amplifications and deletions, respectively; P < 0.0001). Both telomere-bounded and internal SCNAs were more likely to end within the centromere than expected given the centromere’s length (Supplementary Fig. 2c), but differences in their length distributions remained when centromere-bounded events were excluded. Differences between telomere-bounded and internal SCNAs were even more marked for copy-neutral LOH events and dis-played no correlation across lineages (Supplementary Fig. 2d).
We detected chromothripsis in 5% of samples, ranging from 0% of head and neck squamous cell carcinomas to 16% of glioblastomas (Fig. 2b and Online Methods). The rate of chromothripsis was not related to overall rates of SCNA (R = 0.13; P = 0.3). As previously reported30, samples with chromothripsis were more likely to have chromothripsis on more than 1 chromosome (14/122 samples with chromothripsis had 2 or 3 such events; P = 0.003).
Many chromothripsis events were concentrated in a few genomic regions, often associated with known driver events (Fig. 2c). In glioblastomas,
Percent ofsamples withWGD 6245 43 1143 2059 64 5327
a
0
0.5
1.0
Pur
ity
1
2
3
4
5+
LUA
D
LUS
C
HN
SC
KIR
C
BR
CA
BLC
A
CR
C
UC
EC
GB
M
OV
Plo
idy
0 500 1,000Samples
(all lineages)
Near diploid1 WGD2+ WGD
0
4
8
12
16
Arm
-leve
l SC
NA
s/sa
mpl
e
Amplification before WGD
Amplification after WGD
Amplification timing undetermined
0
4
8
12
16
0
20
40
60
Foc
al S
CN
As/
sam
ple
0
20
40
60
Amplification
Deletion before WGDDeletion after WGDDeletion timing undetermined
Near diploid WGD samples
Deletion
KIR
CC
OA
DH
NS
CU
CE
CG
BM
LUA
DLU
SC
BR
CA
BLC
AO
V
Near diploid WGD samples
KIR
CC
OA
DH
NS
CU
CE
CG
BM
LUA
DLU
SC
BR
CA
BLC
AO
V
KIR
CC
OA
DH
NS
CU
CE
CG
BM
LUA
DLU
SC
BR
CA
BLC
AO
V
KIR
CC
OA
DH
NS
CU
CE
CG
BM
LUA
DLU
SC
BR
CA
BLC
AO
V
Overall
Overall
Amplification Deletion
b
Amplification Deletion
Figure 1 Distribution of SCNAs across lineages. (a) Sample purity (top) and ploidy (bottom) across lineages (LUAD, lung adenocarcinoma; LUSC, lung squamous cell; HNSC, head and neck squamous cell; KIRC, kidney renal cell; BRCA, breast; BLCA, bladder; CRC, colorectal; UCEC, uterine cervix; GBM, glioblastoma multiformae; OV, ovary). Box plots show the median, first quartile and third quartile of purity in each lineage. Near-diploid samples are designated in purple; cancers that have undergone one or more than one WGD event are designated in green and red, respectively. Summary data for all lineages are indicated on the right. (b) Numbers of arm-level (top) and focal (bottom) amplifications (left) and deletions (right) across lineages. For each lineage, near-diploid samples and those with WGD events are indicated by bars on the left and right, respectively; SCNA in samples with WGD are resolved according to their timing relative to the WGD event.
Zack et al. Nature Genetics 2013
Goal — Quantify allele-specific SVs
22
Goal - Quantify Allele-Specific SVs
4
Goal - Quantify Allele-Specific SVs
4
Goal - Quantify Allele-Specific SVs
4
Weaver — algorithm overview
23
Probabilistic Graphical Model(Markov Random Field)
Mappability GC Content
Purity ASCNG ASCNS Timing of SV Phasing
SV list BAM file
1KGP haplotypes
SNP list
Cancer Genome Graph SNP linkage SNP LD
(B) (C)
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21
R12
R13
R14
R15
R16
R17
R18
R19
R20
R11 R21
R1R3 R4 R5
Rm
Rp
Rs
Rq
R6
R7
R8
R9
R10
RnR2
R1 R2 R3 R4 R5 R6 R10
R12 R13 R14 R16 R17 R18 R21
R1
R2
R3 R4
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R18 R19 R20 R21
R2
(A)
interchr
del
dup intrachrintrachr
mq→(12,14)
m(12,14)→q
m(12,14)→s
ms→(12,14)R(12,14)
R15
R11 R21
Rp
Rs
Rq
R(3,4) R(5,6)
R(19,20)
R11 R12 R13 R14 R15 R16 R18 R19 R20 R21del
Rt
m +R2 -R2
n +R6 -R10
p +R4 -R16
q -R12 -R21
s +R14 +R18
t +R16 -R18
label L_pos R_pos
2 2 1
1 1 1
2 2 1
2 2 1
2 2 1
1 1 2
L_allele R_allele CN
R1 30 0.33
R2 40 0.5
R5 20 0
R7 10 0
R12 20 0.5
R17 10 0
label cov allele_freq
2 1
2 2
2 0
1 0
1 1
0 1
CN_1 CN_2
Genomic
regions
SVs
Inputs Outputs
(D) (E)
𝜇 0 = 0; 𝜇1 = 1; b = 10
n
m p
t
sq
Time
Post-
Pre-
chrA chrB
chrA
chrB
R1
R2
R10
R(7,9)
Rt R17
R18
RnRm
R16
Coverage from read mapping
Rm
R2
Genome
Cancer
Cancer
Genome
(F)
R3R1
ΨCedge
node
edgeΨR
ΘC
node ΘR[30, 0.33] [30, 0.33]
[40, 0.5] Inputs
Outputs
Figure 2: Illustration of the MRF model used in Weaver. (A) Hypothetical cancer chromosomes with the information of SVsand CNAs hidden. Orange and blue segments represent paternal/maternal allele. Red dashed line represents linkages bySVs. (B) The cancer genome graph, constructed from (A), with nodes (boxes) representing genomic regions and edgesrepresenting reference adjacencies (solid lines) or cancer adjacencies (dashed lines). (C) MRF representation. Red boxesrepresent cancer nodes Rc that have included SVs information; green boxes are the same with (B) and represent genomenodes R; the lines between genome nodes are genome edges Er; the lines between cancer nodes and genome nodesare cancer edges (Ec). (D) Primary inputs and outputs are illustrated for each genome node and cancer node. Inputs arepresented in the form of [coverage, allele frequency], and outputs are presented with colored circles, with color showing theallele and occurrence reflecting the ASCNG. For example, the observed coverage and allele frequency are 30 and 0.33 forgenome node R1 and its output shows that there are two copies of orange alleles and one copy of blue allele for genomenode R1. For the cancer node Rm, the output (one blue circle) indicates that the duplication is on blue allele with one copy.(E) Blue boxes represent supernodes by merging blue shaded chains of genome nodes as shown in (C). (F) Input andoutput of MRF are separated into genomic regions and SVs. For region R1, the input is observed with coverage 30 andallele frequency 0.33; the output has two copies on allele 1 and one copy on allele 2. n is a post-aneuploid deletion withone copy and both breakpoints are on allele 1 of chrA. t is a pre-aneuploid deletion with two copies and both breakpointsare on allele 1 of chrB. SV m, p, q and s are from the allele that has not been duplicated.
19
Input
Output
Li et al. Cell Systems 2016
24
Probabilistic Graphical Model(Markov Random Field)
Mappability GC Content
Purity ASCNG ASCNS Timing of SV Phasing
SV list BAM file
1KGP haplotypes
SNP list
Cancer Genome Graph SNP linkage SNP LD
100 kb21,850,000 21,900,000 21,950,000 22,000,000 22,050,000 22,100,000 22,150,000 22,200,000 22,250,000
MTAP C9orf53CDKN2ACDKN2A
CDKN2B-AS1
CDKN2B
142 _
0 _
chr9 9p23 21.3 21.1 12 9q12 13 31.1 32 33.1
Coverage
(A)
LOH & first amplification Deletion
Second amplification
(B)
Del1
Del2
ASCNS and Timing of SV
Del1
Del2
Del1
Del2
Figure 1: (A) Schema diagram for Weaver. Dark green boxes show the different types of analyses, unique to Weaver thatare not dealt with by other methods, while light green ones show ‘by-products’ of Weaver shown to have an improvementover existing methods. (B) An example demonstrating a Weaver output focused on ASCNS and Timing of SV. Dark bluesegments (two copies) and light blue segment (one copy) represent a portion of the MCF-7 genome that originated from thesame allele on chr9. The other allele was lost during tumorigenesis, resulting in LOH. The predicted evolution of this regionbased on Weaver’s output is shown at the bottom: the ASCNS of Del1 is 2 and the ASCNS of Del2 is 1; both deletionsoccurred after the first amplification of the allele and before the second amplification.
18
! MRF: • genome node, cancer node, genome edge, cancer edge
25
(B) (C)
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21R12
R13
R14
R15R16
R17R18
R19
R20
R11 R21
R1 R3 R4 R5
Rm
Rp
Rs
Rq
R6
R7
R8
R9
R10
RnR2
R1 R2 R3 R4 R5 R6 R10
R12 R13 R14 R16 R17 R18 R21R1
R2
R3 R4
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R18 R19 R20 R21
R2
(A)
interchrdel
dup intrachrintrachr
mq→(12,14)
m(12,14)→q
m(12,14)→s
ms→(12,14)R(12,14)
R15
R11 R21
Rp
Rs
Rq
R(3,4) R(5,6)
R(19,20)
R11 R12 R13 R14 R15 R16 R18 R19 R20 R21del
Rt
m +R2 -R2n +R6 -R10p +R4 -R16q -R12 -R21s +R14 +R18t +R16 -R18
label L_pos R_pos2 2 11 1 12 2 1 2 2 12 2 11 1 2
L_allele R_allele CN
R1 30 0.33R2 40 0.5R5 20 0R7 10 0R12 20 0.5R17 10 0
label cov allele_freq2 1 2 22 01 01 10 1
CN_1 CN_2
Genomic regions
SVs
Inputs Outputs(D) (E)
𝜇 0 = 0; 𝜇1 = 1; b = 10
n
m p
t
sq
Time
Post-
Pre-
chrA chrB
chrA
chrB
R1
R2
R10
R(7,9)
Rt R17
R18
RnRm
R16
Figure 5: (A) Hypothetical cancer chromosomes with rearrangement structure hidden. Orange and blue segments repre-sent paternal/maternal allele. Red dashed line represent linkages by SVs. (B) The Cancer Genome Graph, constructedfrom (A), with nodes (boxes) representing genomic regions and edges representing reference (solid lines) or cancer (dashedlines) adjacencies. (C) MRF representation in Weaver. Red boxes represent cancer nodes(Rc) that have included SVsinformation; green boxes are the same with (B) and representing genome nodes(R); the lines between genome nodes aregenome edges(Er); the lines between cancer nodes and genome nodes are cancer edges(Ec). (D) Blue boxes representsupernodes by clustering blue shaded chains of genome nodes as shown in (C). (E) Input and output of MRF are separatedinto genomic regions and SVs. For region R1, the input is observed coverage 30 and allele frequency 0.33; the output is 2copies on allele 1 and 1 copy on allele 2. n is a post-aneuploid deletion with 1 copy and both breakpoints are on allele 1 ofchrA. t is a pre-aneuploid deletion with 2 copies and both breakpoints are on allele 1 of chrB. SV m, p, q and s are from theallele that has not been duplicated.
39
Cancer Genome Graph
(B) (C)
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21R12
R13
R14
R15R16
R17R18
R19
R20
R11 R21
R1 R3 R4 R5
Rm
Rp
Rs
Rq
R6
R7
R8
R9
R10
RnR2
R1 R2 R3 R4 R5 R6 R10
R12 R13 R14 R16 R17 R18 R21R1
R2
R3 R4
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R18 R19 R20 R21
R2
(A)
interchrdel
dup intrachrintrachr
mq→(12,14)
m(12,14)→q
m(12,14)→s
ms→(12,14)R(12,14)
R15
R11 R21
Rp
Rs
Rq
R(3,4) R(5,6)
R(19,20)
R11 R12 R13 R14 R15 R16 R18 R19 R20 R21del
Rt
m +R2 -R2n +R6 -R10p +R4 -R16q -R12 -R21s +R14 +R18t +R16 -R18
label L_pos R_pos2 2 11 1 12 2 1 2 2 12 2 11 1 2
L_allele R_allele CN
R1 30 0.33R2 40 0.5R5 20 0R7 10 0R12 20 0.5R17 10 0
label cov allele_freq2 1 2 22 01 01 10 1
CN_1 CN_2
Genomic regions
SVs
Inputs Outputs(D) (E)
𝜇 0 = 0; 𝜇1 = 1; b = 10
n
m p
t
sq
Time
Post-
Pre-
chrA chrB
chrA
chrB
R1
R2
R10
R(7,9)
Rt R17
R18
RnRm
R16
Figure 5: (A) Hypothetical cancer chromosomes with rearrangement structure hidden. Orange and blue segments repre-sent paternal/maternal allele. Red dashed line represent linkages by SVs. (B) The Cancer Genome Graph, constructedfrom (A), with nodes (boxes) representing genomic regions and edges representing reference (solid lines) or cancer (dashedlines) adjacencies. (C) MRF representation in Weaver. Red boxes represent cancer nodes(Rc) that have included SVsinformation; green boxes are the same with (B) and representing genome nodes(R); the lines between genome nodes aregenome edges(Er); the lines between cancer nodes and genome nodes are cancer edges(Ec). (D) Blue boxes representsupernodes by clustering blue shaded chains of genome nodes as shown in (C). (E) Input and output of MRF are separatedinto genomic regions and SVs. For region R1, the input is observed coverage 30 and allele frequency 0.33; the output is 2copies on allele 1 and 1 copy on allele 2. n is a post-aneuploid deletion with 1 copy and both breakpoints are on allele 1 ofchrA. t is a pre-aneuploid deletion with 2 copies and both breakpoints are on allele 1 of chrB. SV m, p, q and s are from theallele that has not been duplicated.
39
MRF representation
ONLINE METHODSThe overview of the Weaver algorithm is shown in Fig. 4. The input of Weaver is the BAM file of aligned
and unaligned reads from a particular tumor sample. If there is matched normal sample available, it will also beused (details in Section ). The first step is to call variants (including both SNPs and SVs) based on the BAM file. Inthe Weaver implementation, we designed our own SVs calling procedure with details in Supplementary Note 1(evaluation results in Table. 1). For SNPs, we utilized SAMtools (version 0.1.19) [49]. In order to infer ASCNG,as well as the phasing of SNPs, we only retain SNPs from the original SNP list from SAMtools with the followingcriteria: (i) being heterozygous; (ii) not in segmental duplications; (iii) mappability > 0.2; and (iv) reported in the1000 Genomes Project (1KGP). The SNP linkage calculations will be discussed in Section .
Using the intermediate results (yellow boxes in Fig. 4) including the cancer genome graph construction (Sec-tion ), the Weaver MRF model will be built. By solving the MRF MAP function (Equation ), Weaver generatesoutput as in the green boxes in Fig. 4 (see Fig. 5(E) for example).
The core modules in Weaver were written in C++. Weaver source code is freely available and can bedownloaded from: http://bioen-compbio.bioen.illinois.edu/weaver/.
Genome partitioning and cancer genome graph constructionWe first select a default size W (e.g., 5kb) and partition the genome into non-overlapping regions as follows:
(i) Breakpoints in SV set C must be on region boundaries; (ii) Each region may contain no more than one SNP; (iii)The size of each region must be W . The number of regions from initial segmentation ranges from 1.7 million to2 million from Weaver based on various datasets, depending on the size of loss of heterozygosity (LOH) regionsand the number of SVs. The rationale behind the segmentation step with SVs is that it is known that most of thetime ASCNG boundaries coincide with SV breakpoints [14]. Our segmentation approach using SV boundaries hasthe advantage to provide base-level ASCNG boundaries as compared to existing genome segmentation methods incopy number analysis, which typically use fixed segmentation size.
Given the segmentation of the genome and SV set C, we then build Cancer Genome Graph G := {R,E}(Fig. 5(B)), with nodes representing genomic region sets (R) and edges representing reference adjacencies (Er)(solid lines in the figure) if two nodes are adjacent in the normal genome and cancer adjacencies (Ec) (dashedlines in the figure) if two nodes are adjacent in the cancer genome by SV c linkage. Edge configurations E betweennode Ri and Rj can be represented as: (�iRi ⇠ �jRj), � 2 {+,�}, with + and � representing the tail (right) andhead (left) of a given genomic region R, e.g., (+Ri ⇠ �Ri+1) 2 Er, if Ri and Ri+1 are adjacent regions from thesame chromosome in the normal genome.
We then convert the original Cancer Genome Graph G := {R,E} into Markov Random Field (MRF, M :=
{R,Rc,Er,Ec}), which is a widely used probabilistic graphical model to estimate joint probabilities. The MRFcan be viewed as undirected graph and the aggregated inference problem in Weaver given sequencing data can beviewed as a maximum a posteriori (MAP) problem with hidden states and observations explained in the followingsections. Unlike conventional methods for estimating copy number changes based on hidden Markov models(HMMs), which are designed for sequential data and only consider the dependencies between ‘local’ variables,MAP solution of MRF model provides the most probable configuration of aneuploid cancer genomes with complexSVs, involving ‘global’ variable dependencies defined by long-range SVs. The detailed steps are described inSupplementary Note 6. In the following sections, we describe hidden states, observations, and formal functionof the MRF MAP problem. Details on potential functions on nodes and edges are provided in the SupplementaryNote.
Hidden states HFor ith genome node Ri 2 R ⇢ M, the hidden states are Hi={Ca
i ,Cbi , G
ai , G
bi}, where Ca
i = {Cai,0, ..., C
ai,K}
and Cbi = {Cb
i,0, ..., Cbi,K} are vectors of non-negative integral numbers representing copy numbers (CNs) for allele
a and b of kth population on Ri, respectively. When k = 0, it stands for the fraction of normal cells. Note thatalthough the Weaver algorithm is generic and in principle can be applied for multiple subclones (K > 1), inour current implementation, Weaver only processes tumor samples without significant subclonal structure (i.e.,
6
ONLINE METHODSThe overview of the Weaver algorithm is shown in Fig. 4. The input of Weaver is the BAM file of aligned
and unaligned reads from a particular tumor sample. If there is matched normal sample available, it will also beused (details in Section ). The first step is to call variants (including both SNPs and SVs) based on the BAM file. Inthe Weaver implementation, we designed our own SVs calling procedure with details in Supplementary Note 1(evaluation results in Table. 1). For SNPs, we utilized SAMtools (version 0.1.19) [49]. In order to infer ASCNG,as well as the phasing of SNPs, we only retain SNPs from the original SNP list from SAMtools with the followingcriteria: (i) being heterozygous; (ii) not in segmental duplications; (iii) mappability > 0.2; and (iv) reported in the1000 Genomes Project (1KGP). The SNP linkage calculations will be discussed in Section .
Using the intermediate results (yellow boxes in Fig. 4) including the cancer genome graph construction (Sec-tion ), the Weaver MRF model will be built. By solving the MRF MAP function (Equation ), Weaver generatesoutput as in the green boxes in Fig. 4 (see Fig. 5(E) for example).
The core modules in Weaver were written in C++. Weaver source code is freely available and can bedownloaded from: http://bioen-compbio.bioen.illinois.edu/weaver/.
Genome partitioning and cancer genome graph constructionWe first select a default size W (e.g., 5kb) and partition the genome into non-overlapping regions as follows:
(i) Breakpoints in SV set C must be on region boundaries; (ii) Each region may contain no more than one SNP; (iii)The size of each region must be W . The number of regions from initial segmentation ranges from 1.7 million to2 million from Weaver based on various datasets, depending on the size of loss of heterozygosity (LOH) regionsand the number of SVs. The rationale behind the segmentation step with SVs is that it is known that most of thetime ASCNG boundaries coincide with SV breakpoints [14]. Our segmentation approach using SV boundaries hasthe advantage to provide base-level ASCNG boundaries as compared to existing genome segmentation methods incopy number analysis, which typically use fixed segmentation size.
Given the segmentation of the genome and SV set C, we then build Cancer Genome Graph G := {R,E}(Fig. 5(B)), with nodes representing genomic region sets (R) and edges representing reference adjacencies (Er)(solid lines in the figure) if two nodes are adjacent in the normal genome and cancer adjacencies (Ec) (dashedlines in the figure) if two nodes are adjacent in the cancer genome by SV c linkage. Edge configurations E betweennode Ri and Rj can be represented as: (�iRi ⇠ �jRj), � 2 {+,�}, with + and � representing the tail (right) andhead (left) of a given genomic region R, e.g., (+Ri ⇠ �Ri+1) 2 Er, if Ri and Ri+1 are adjacent regions from thesame chromosome in the normal genome.
We then convert the original Cancer Genome Graph G := {R,E} into Markov Random Field (MRF, M :=
{R,Rc,Er,Ec}), which is a widely used probabilistic graphical model to estimate joint probabilities. The MRFcan be viewed as undirected graph and the aggregated inference problem in Weaver given sequencing data can beviewed as a maximum a posteriori (MAP) problem with hidden states and observations explained in the followingsections. Unlike conventional methods for estimating copy number changes based on hidden Markov models(HMMs), which are designed for sequential data and only consider the dependencies between ‘local’ variables,MAP solution of MRF model provides the most probable configuration of aneuploid cancer genomes with complexSVs, involving ‘global’ variable dependencies defined by long-range SVs. The detailed steps are described inSupplementary Note 6. In the following sections, we describe hidden states, observations, and formal functionof the MRF MAP problem. Details on potential functions on nodes and edges are provided in the SupplementaryNote.
Hidden states HFor ith genome node Ri 2 R ⇢ M, the hidden states are Hi={Ca
i ,Cbi , G
ai , G
bi}, where Ca
i = {Cai,0, ..., C
ai,K}
and Cbi = {Cb
i,0, ..., Cbi,K} are vectors of non-negative integral numbers representing copy numbers (CNs) for allele
a and b of kth population on Ri, respectively. When k = 0, it stands for the fraction of normal cells. Note thatalthough the Weaver algorithm is generic and in principle can be applied for multiple subclones (K > 1), inour current implementation, Weaver only processes tumor samples without significant subclonal structure (i.e.,
6
K = 1). We leave the cases for K > 1 as future work. Gai and Gb
i represent the genotype of allele a and b of Ri,which is independent from subclone structure since only germline SNPs are considered. For convenience, we alsoset variable Ci,k as the overall CN of kth population on Ri (Ci,k = Ca
i,k+Cbi,k). In our analysis of cancer genomes,
which typically have highly amplified regions, we do not have limit for Ci,k, as done by previous CNV methods.The hidden CN is bounded by the observation of sequencing depth on each region. Note that for regions with lowmappability or extreme GC content, it is not reliable to infer hidden state space with observed local sequencingcoverage; instead, we search the closest region and inherit its hidden state space setting, assuming that there is nodramatic state change between them.
The hidden states on cancer nodes Rc are discussed in Supplementary Method 4.
Observations OFor observation on R ⇢ M, on ith genomic region Ri 2 R, the observation from the hidden state is the
raw read coverage Oi on entire Ri, which can be estimated by BEDTools [50] based on BAM file. For tumorsample with matched normal genome sequenced, we calculate ONorm
i for the same Ri and normalize the Oi using:Onew
i = ONorm ⇥Oi/ONormi , where ONorm is the median coverage for all regions in the normal genome.
If Ri has SNP, Oai and Ob
i are the number of reads containing the SNP based on a/b allele, respectively,which can be obtained from SNP calling pipelines such as [51]. In practice, neither sequencing nor mapping isuniform across the genome. Here we consider two widely used factors, the GC-content and short read mappability.Using two HapMap samples NA18507 and NA12878, we split the human genome into consecutive 100bp bins andcalculated the average mapping coverage on each bin. Among the bins that have unexpected low or high coverageas compared to the rest of the genome, more than 91% have either mappability < 0.6 or GC-content < 0.2 or> 0.6. Therefore, we label all Ri as not read-depth informative, if mappability < 0.6 or GC-content < 0.2 or> 0.6. The read depth of those uninformative regions are inherited from neighboring regions.
Regarding observation on Er ⇢ M, within two adjacent genomic regions Ri, Ri+1 2 R, there are twoindependent observations for their genotype linkage.
(i) We assume the genotypes on i and i + 1 are Gai /G
bi and Ga
i+1/Gbi+1, respectively. We define the Linkage
Disequilibrium (LD) score for the phasing configuration Gai , G
ai+1/G
bi , G
bi+1 as:
LD(Gai , G
ai+1/G
bi , G
bi+1)
=
NLD(Gai , G
ai+1)⇥NLD(Gb
i , Gbi+1)
NLD(Gai , G
ai+1)⇥NLD(Gb
i , Gbi+1) +NLD(Ga
i , Gbi+1)⇥NLD(Gb
i , Gai+1)
where NLD(Gai , G
ai+1) is the number of phased haplotypes (total number 1092 ⇥ 2 in phase 1) in 1KGP with
genotype (Gai , G
ai+1). Other genotype configurations can be similarly calculated.
(ii) Similarly, we define the read linkage score for the phasing Gai , G
ai+1/G
bi , G
bi+1 as:
RL(Gai , G
ai+1/G
bi , G
bi+1) =
NRL(Gai , G
ai+1) +NRL(Gb
i , Gbi+1)
NRL(Ri, Ri+1)
where NRL(Ri, Ri+1) is the total number of reads covering genomic regions (Ri, Ri+1) and NRL(Gai , G
ai+1) is
total number of reads covering (Gai , G
ai+1). If there are no reads covering (Ri, Ri+1) (NRL(i, i+1) = 0), RL = 0.
Therefore, we define genotype linkage as
GL(Gai , G
ai+1/G
bi , G
bi+1) = log(LD(Ga
i , Gai+1/G
bi , G
bi+1) ⇤RL(Ga
i , Gai+1/G
bi , G
bi+1))
In real data application, we have found that RL and LD correlate very well. For example, in the MCF-7 analysis,when we chose SNP pairs with 100% RL support as gold standard, we found AUC= 0.9964 using LD scores.
Markov random field model MAfter we convert G into MRF M using steps in Supplementary Note 6, the MRF MAP problem is given by:
ˆH = argmaxH
8
<
:
X
i2R⇥R(O|Hi) +
X
c2C⇥C(O|Hc) +
X
i2R R(O|Hi, Hi+1) +
X
c2C
X
i2N (c)
C(Hi, Hc)
9
=
;
7
genome node potential function
cancer nodepotential function
genome edge potential function
cancer edgepotential function
ASCN and SVs in MCF-7
26
! 83% of SVs have copy number > 1 ! 68% of the regions have imbalanced copy number ! We found 276 SVs after whole chromosome dup ! We have used physical mapping to validate the results
ASCN and SVs in HeLa
! WGS reads obtained from Adey et al. Nature 2013
! ASCNG are 97% consistent with Adey et al. (Fosmid seq)
27
Structural variants were identified by clustering discordantlymapped reads from 40-kb and 3-kb mate-pair libraries (Supplemen-tary Fig. 8). Twenty interchromosomal links were identified, includinglinks for marker chromosomes M11 (9q33–11p14) and M14 (13q21–19p13). In addition, 209 HeLa-specific deletions and 8 inversions werefound (Supplementary Figs 9 and 11, and Supplementary Table 10).Only two genes that are impacted by HeLa-specific structural rearran-gements (Supplementary Table 11) intersected with SCGC (STK11(ref. 18), FHIT), both of which are recurrently deleted in cervicalcarcinomas18,19.
Conventional whole-genome sequencing fails to resolve haplotypephase, an essential aspect of the description and interpretation of non-haploid genomes, including cancer genomes20. Recently, severalgroups have demonstrated genome-wide measurement of local5 orsparse21 haplotypes, but these approaches have yet to be applied toaneuploid cancer genomes. To resolve haplotype phase across theHeLa genome, we sequenced pools of fosmid clones5. Specifically,we constructed three complex fosmid-clone libraries, and then carriedout limiting dilution and shotgun sequencing of 288 fosmid clonepools. In summary, these were estimated to include 518,293 individualnon-overlapping clones with a median insert size of 33 kb, for a totalphysical coverage of 6.33 of the haploid reference genome (Sup-plementary Fig. 12). The complement of likely inherited heterozygousvariants (SNP and indel, n 5 1.97 3 106) was ascertained by shotgunsequencing and by cross-referencing with calls made by the 1000Genomes Project, and then re-genotyped using reads from each clone
pool. Alleles that were present at distinct heterozygous sites within agiven clone were assigned, or ‘phased’, to the same inherited haplotype,and the unobserved alleles were implicitly phased to the oppositehaplotype. When overlapping clones from distinct pools were merged,this resulted in haplotype blocks with an N50 (the contig size abovewhich 50% of the total length of the haplotype assembly is included) of550 kb containing 90.6% of heterozygous variants that were probablyinherited.
Most of the HeLa genome is present at an uneven haplotype ratio(for example, 2:1 in regions in which copy number 5 3). We sought toexploit the resulting allelic imbalance to phase consecutive haplotypeblocks (Supplementary Fig. 13). We first calculated the cumulativeallelic ratio among shotgun reads for the SNVs residing in each hap-lotype block, which clustered closely with the underlying haplotyperatio. For example, in non-LOH regions with a copy number of 3 thathave ratios of 2:1 or 1:2, allelic ratios calculated for each block haddistributions centred on 0.32 or 0.65, close to the expected fractions ofone-third and two-thirds (Supplementary Fig. 14). Using these ratios,we merged haplotype blocks into scaffolds covering 1.96 Gb or 90.3%of the non-LOH HeLa genome (scaffold N50 of 44.8 megabases (Mb);Supplementary Table 12). The haplotype-resolved scaffolds were thenmerged with the copy-number map to produce a global, haplotype-resolved copy-number profile of the aneuploid HeLa genome (Fig. 1a,Supplementary Fig. 15 and Supplementary Table 13).
Phasing accuracy was independently confirmed by several methods.First, 99.7% of informative read pairs from 3-kb mate-pair sequencing
1 2 3 4 5
X
6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22
HPV integration
3q11M5, S
Linked position
Markerchromosome
nameSupported by
Sequence data
Colour indicatessuspectedhaplotype
Haplotype A
Haplotype B
Tandemduplication
Probablecontiguity
a
1q11M1
1q11M25
15q11M18
9p11M10
3p21M10
5q11M4
3p11M4
12q15M12
5p2xM7
11p14M11,S
9q33M11,S
9q33M11
19p13M14,S
21q11M18
20p11M15
13q21M14,S
15q11M18
3q11M1
1p11M2
9q11M2
15qM13
21q11M25 11q22
M11
5pmarker
M7HPV locus
4q31-35 6q13-2118q1
2
3
4
5
6
7
8
3q24-29
LOH
Chr18/
S3 window ratiosCCL-2 window ratiosS3 copy-number callsS3-specific differences
Win
dow
ratio
; cop
y nu
mbe
r
b
Genomic position
Figure 1 | Haplotype-resolved copynumber of the HeLa cancer cell linegenome. a, Copy-number profile ofHeLa split by haplotypes. Linksdenote likely contiguity and tandemduplications. Boxes indicate markerchromosomes identified by copy-number breakpoints (boxes arecoloured by haplotype; black,unknown; pink text, uncertainlocations; S, links confirmed bymate-pair sequencing). b, Windowedcopy-number ratios for HeLa CCL-2(green and purple, alternatingchromosomes) and HeLa S3 (grey),with predicted integer copy numberfor S3 (black). Notable straindifferences are indicated by redarrows (for example, reduced copyover chromosome 18q). The windowcontaining the HPV insertion andrearrangement is at elevated copy inboth strains.
RESEARCH LETTER
2 0 8 | N A T U R E | V O L 5 0 0 | 8 A U G U S T 2 0 1 3
Macmillan Publishers Limited. All rights reserved©2013
Adey et al. Nature, 2013
Application to TCGA Data
! Inter-chromosomal chromothripsis
28
1X
62X
(A) (B)
FOXG1
4
2214
6(C)
Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571).(B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observedat boundaries of highly amplified region of chr14, indicating many rounds of BFB cycles happened. Gene FOXG1 is on thefold-back inversion boundary and highly amplified.
32
! Breakage-fusion-bridge amplifications
1X
62X
(A) (B)
FOXG1
4
2214
6(C)
Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571).(B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observedat boundaries of highly amplified region of chr14, indicating many rounds of BFB cycles happened. Gene FOXG1 is on thefold-back inversion boundary and highly amplified.
32
1X
62X
(A) (B)
FOXG1
4
2214
6(C)
Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571).(B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observedat boundaries of highly amplified region of chr14, indicating many rounds of BFB cycles happened. Gene FOXG1 is on thefold-back inversion boundary and highly amplified.
32
TCGA-36-1571