28
Machine Learning Applications in Computational Genomics — Some new algorithms for understanding cancer genomes Jian Ma Computational Biology Department School of Computer Science

The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Embed Size (px)

Citation preview

Page 1: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Machine Learning Applications in Computational Genomics

— Some new algorithms for understanding cancer genomes

Jian Ma

Computational Biology Department School of Computer Science

Page 2: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

2

TCTCTCAGAGGGCCCTGATGGAAGAATCCCCCTACCACCCTTCCAGGCTGACTTCTGTCTATTTCTCCTGCAGAGTGAGCTGGACTTGGAAAAGGGCTTGGAGATGAGAAAATGGGTCCTGTCGGGAATCCTGGCTAGCGAGGAGACTTACCTGAGCCACCTGGAGGCACTGCTGCTGGTGAGGAGGATTTAGGGAGCTGAGCAGGGCGGGATGGGGCAGGGTGACAGGGTTGGGGAGCCTCTTTGCCCTTAAGTCCCAGGTCAGCTGTCAGAGCCTGGGTGCAGCTCGCCATCCCTGGAGTGGATACCAGTGGAAGACTGAGTTGCCAAACCAAGCTGGTTTTAAAATTGTATTTGTTATGTGATTTAAAAATAAAAGTGCATATGTCAGGTAACCATGACTGTCTACTGCCATACAATGCACCTGACGGATGGCAGCCCCTCTCACCTGTGCTACCTCACTTGTGCCCTCTTCCAGCCCATGAAGCCTTTGAAAGCCGCTGCCACCACCTCTCAGCCGGTGCTGACGAGTCAGCAGATCGAGACCATCTTCTTCAAAGTGCCTGAGCTCTACGAGATCCACAAGGAGTTCTATGATGGGCTCTTCCCCCGCGTGCAGCAGTGGAGCCACCAGCAGCGGGTGGGCGACCTCTTCCAGAAGCTGGTGAGTAACCCAGGGCCGGTGCTGGGACTACAGGCGTGTACCACCACGTCCAGCTAATTTTTTGCATTTTTAGTAGAGACAGGGTTTTGCTATGTTGGCCAGGCTGGTCTCAAACTCCTAACCTCAAGTGATCCACCTGCCTCAGCCTCCCAAAGTACTGAGATTACAGGCGTGAGCCGCCATGCCCAGCCTTTTTTTTTTTTTTTCTAATTTATATTTATTTAGATAGTTATTTTTAAAAAGAGATGGGGACTTACTACGTTGTCCAGGCTGGAGTGCAGTGGCTATTCACAGGCGCAATTCCACTGCTCATCAGCACGGGAGTTTTGACCTCCTTCCTTTCCAACCTTGGCTGTTTCACTCCTTCTTAGGCAAACTGATGGTTCCCGACTCCTGGGAGGTCACCATATTGATGCCAAACTTAGTGTGTAGTGCACTACAGCCCAGAACTCCTGACTGAAGCCATCCTCCGGCCTCAGCCTTCCGCGTAGCTGGGGCTATAGGTGCACGCCACCACACCCTGTGTGTGGCTGGGACTACAGGTGCACGCCATCACACCCTGTGTGCGCCATCACACCCTGTGTGCACCATCACACCCTGTGTGCACACACTTTCCCTAAAGCAGGCTTCCTCCGCTGGGAAACAAGTCCTCTAGGGGCAGGTGTGGCCAGAGGCCAGGCCCCCCTCTAAGTGTGAAGAGCATGTGATTCCTTAAAAGCCCTTCCCCCAGCACTTCTGGACTACCGAGACACACAGCTCTGGCCTCGGGCCTCCCCTTGGCTGGTGCTGGGGGCTGAGTTTTCTGCTCTGAGGTGTGGCTTTCCTGTAGGGGGACCCCTCCCTCTGCCACCCTGTGCTGCAGACCCCCAGACTCCAGGCCAGAGCTAAGGCTTGAGGAACACAGAAGGCACTTAATTTGTTCCAGTTCTTGCTCCCTGGGGCTCTTTCCCCCATGGCCAGAGAGCAGGAGGCTGTATTTTGATACATGCTGCCCCCTCCATCTTTGAAGCCCCCCCACCCCCGTTTCTCCGTGTGTGTGTCAGCAGTTTTAAACCTAGTGGAGGGTGGTGGCTCGGGCTGGGCTCCGCGTCGGGCTGCCCCGCAGCTGCTCTTGGGCAGCCAGGGCCGCTGGGTGTGGGGCCGCCGGGAATGGCGGGCCCGGGTGAGGGCGGGCCCGGGTGAGGGCGGGGGCGGAGAGGCGAAGAAGCTGCAGGAAGGGAGGGTGACGAGGGGGAAGCGAAGGAAGGGGAAGAGGAAGGGAAAAGCGAGCGAGAGGGGCAAGGCGGAAGAGGAAGCAGGGCGGAAGGGAAGCCCGGGCCGCAGACGGCGAAGGAGGCAGCGGGCCGGGGGCTGAGGCGGGAGCGAGGACACGCCCAAGAGAGGAAGCAGAGGGAGGCGGAAGCGTGGAGGAAGGGGCGAGAGGCATCATCAAAGGAGATGAGGGGAGCGTAGGGGCCGGGAAAGAGGCACAAGGAAGAAAGTATGGGAAGGAGGAATGGAGGGTCAGGGCTAGGCGGCGGGAGGGCGCCAGGCCGGGAAGAGTACAAGGACAAGGAGGTCAGGTTTGGGCCTACATCCCGGGGACAGGGGCGGCCATGGCGGCGGCAGCCAGGGAGGAGGAGGAGGAGGCGGCTCGGGAGTCAGCCGCCTGCCCGGCTGCGGGGCCAGCGCTCTGGCGCCTGCCGGAAGTGCTGCTGCTGCACATGTGCTCCTACCTCGACATGCGGGCCCTCGGCCGCCTGGCCCAGGTGTACCGCTGGCTGTGGCACTTCACCAACTGCGACCTGCTCCGGCGCCAGATAGCCTGGGCCTCGCTCAACTCCGGCTTCACGCGGCTCGGCACCAACCTGATGACCAGTGTCCCAGTGAAGGTGTCTCAGAACTGGATAGTGGGGTGCTGCCGAGAGGGGATTCTGCTGAAGTGGAGATGCAGTCAGATGCCCTGGATGCAGCTAGAGGATGATGCTTTGTACATATCCCAGGCTAATTTCATCCTGGCCTACCAGTTCCGTCCAGATGGTGCCAGCTTGAACCGTCAGCCTCTGGGAGTCTGCTGGGCATGATGAGGACGTTTGCCACTTTGTGCTGGCCACCTCGCATATTGTCAGTGCAGGAGGAGATGGGAAGATTGGCCTTGGTAAGATTCACAGCACCTTCGCTGCCAAGTACTGGGCTCATGAACAGGAGGTGAACTGTGTGGATTGCAAAGGGGGCATCATATCATTGTGAGTGGCTCCAGGGACAGGACGGCCAAGGTGTGGCCTTTGGCCTCAGGCCAGCTGGGGTAGTGTTTATACACCATCCAGACTGAAGACCAAATCTGGTCTGTTGCTATC

Fundamental question: How the changes in genome sequences give rise to phenotypic differences (e.g., disease states)

! When they got into the genome and how they have evolved

! Their roles in genome organization and gene regulation for human biology

! Their implications in human diseases such as cancer

Our goal — from base-pairs to bedside

Page 3: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Why Computational Genomics?

3

Page 4: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Why Computational Genomics?

! Key to personalized precision medicine, especially for cancer

4

David Patterson! Cancer research has become big data science ! How to store and manage data efficiently ! How to analyze data in a distributed environment ! How to enhance data security but reduce barriers for sharing ! How to extract meaningful patterns ! How to identify mechanisms to help treatment ! …

Page 5: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

The Human genome: the “blueprint” of our body

5

GTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATT

James WatsonFrancis Crick

February 15, 2001

March, 2011

Page 6: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

DNA, Chromosome, and Genome

6

244 Chapter 4: DNA, Chromosomes, and Genomes

" beads-on-a-str ing "form of chromat in

30-nm chromat inf iber of packednucleosomes

Figure 4-72 Chromatin packing. Thismodel shows some of the many levels ofchromatin packing postulated to give r iseto the highly condensed mitot icchromosome.

sect ion ofchromosome inextended form

condensed sect ionof chromosome

enti remitot icchromosome

T300 nm

I

Tl 1 n mI

T30 nm

I

TI700 nm

Ii

T1400 nmI

NET RESULT: EACH DNA MOLECULE HAS BEENPACKAGED INTO A MITOTIC CHROMOSOME THAT

IS 1O,OOO-FOLD SHORTER THAN ITS EXTENDED LENGTH

Figure 4-73 The SMC proteins in condensins. (A) Electron micrographs ofa puri f ied SMC dimer. (B) The structure of a SMC dimer. The long centralregion of this protein is an antiparal lel coi led-coi l (see Figure 3-9) with af lexible hinge in i ts middle. (C) A model for the way in which the SMCproteins in condensins might compact chromatin. In real i ty, SMC proteinsare components of a much larger condensin complex. l t has beenproposed that, in the cel l , condensins coi l long str ings of looped chromatindomains (see Figure 4-57). ln this wa, the condensins could form astructural framework that maintains the DNA in a highly organized stateduring metaphase of the cel l cycle. (A, courtesy of H.P. Erickson; B and C,adapted from T. Hirano, Not. Rev. Mol. Cell Biol.7:311-322,2006. Withpermission from Macmil lan Publishers Ltd.)

retCHROMOSOMAL DNA AND ITS PACKAGING IN THE CHROMATIN FIBER

(A) (B) - r^

along each mitotic chromosome (Figure 4-f l). The structural bases for thesebanding patterns are not well understood. Nevertheless, the pattern of bands oneach type of chromosome is unique, and it is these patterns that initially allowedeach human chromosome to be identified and numbered.

The display of the 46 human chromosomes at mitosis is called the humankaryotype. If parts of chromosomes are lost or are switched between chromo-somes, these changes can be detected by changes in the banding patterns or bychanges in the pattern of chromosome painting (Figure 4-12). Cytogeneticistsuse these alterations to detect chromosome abnormalities that are associatedwith inherited defects, as well as to characterize cancers that are associated withspecific chromosome rearrangements in somatic cells (discussed in Chapter 20).

203

Figure 4-10 The complete set of humanchromosomes. These chromosomes, froma male, were isolated from a cel lundergoing nuclear division (mitosis) andare therefore highly compacted. Eachchromosome has been "painted" adif ferent color to permit i ts unambiguousidenti f icat ion under the l ight microscope.Chromosome paint ing is performed byexposing the chromosomes to a col lect ionof human DNA molecules that have beencoupled to a combination of f luorescentdyes. For example, DNA molecules derivedfrom chromosome 1 are labeled with onespecific dye combination, those fromchromosome 2 with another, and so on.Because the labeled DNA can form basepairs, or hybridize, only to thechromosome from which it was derived(discussed in Chapter 8), eachchromosome is dif ferently labeled. Forsuch experiments, the chromosomes aresubjected to treatments that separate thedouble-hel ical DNA into individual strands,designed to permit base-pair ing with thesingle-stranded labeled DNA whilekeeping the chromosome structurerelat ively intact. (A) The chromosomesvisual ized as they original ly spi l led fromthe lysed cel l . (B) The same chromosomesart i f ic ial ly l ined up in their numerical order.This arrangement of the ful l chromosomeset is cal led a karyotype. (From E. Schrocket al.. Science 273:494-497,1996. Withpermission from AAAS.)

Figure 4-1 1 The banding patterns ofhuman chromosomes. Chromosomes1-22 are numbered in approximate orderof size. A typical human somatic (non-germ-l ine) cel l contains two of each ofthese chromosomes, plus two sexchromosomes-two X chromosomes in afemale, one X and one Y chromosome in amale. The chromosomes used to makethese maps were stained at an early stagein mitosis, when the chromosomes areincompletely compacted. Th e horizontolred line represents the position of thecentromere (see Figure 4-21), whichappears as a constriction on mitoticchromosomes. The red knobs onchromosomes 13, ' l4 , 15 ,21 ,and22indicate the posit ions of genes that codefor the large r ibosomal RNAs (discussed inChapter 6). These patterns are obtained bystaining chromosomes with Giemsa stain,and they can be observed under the l ightmicroscope. (For micrographs, see Figure21 -1 8; adapted from U. Franke , Cytogenet.Cell Genet.31:24-32, 1981. With

5

2

I 5

DNA double hel ix

5' Y3'

hydrogen-bondedbase pairs

4-4). This complementary base-pairlng enables the base pairs to be packed inthe energetically most favorable arrangement in the interior of the double helix.In this arrangement, each base pair is of similar width, thus holding the sugar-phosphate backbones an equal distance apart along the DNA molecule. To max-imize the efficiency of base-pair packing, the two sugar-phosphate backbones

198 Chapter 4: DNA, Chromosomes, and Genomes

bui ld ing blocks of DNAphosphate

\ suqa r' ; +K-sugar oase

phosphaten e

double-stranded DNA

llilii:i:ilitffi $$iiiffi liiiii:ii:iii Figure 4-3 DNA and its building blocks.<CAGA> DNA is made of four types ofnucleotides, which are linked covalentlyinto a polynucleotide chain (a DNAstrand) with a sugar-phosphatebackbone from which the bases (A, C, G,and T) extend. A DNA molecule iscomposed of two DNA strands heldtogether by hydrogen bonds betweenthe paired bases.The arrowheads attheends ofthe DNA strands indicate thepolarities of the two strands, which runantiparal lel to each other in the DNAmolecule. In the diagram at the bottomleft of the figure, the DNA molecule isshown straightened out; in reality, it istwisted into a double hel ix, as shown onthe r ight. For detai ls, see Figure 4-5.

Figure 4-4 Complementary base pairs inthe DNA double hel ix. The shapes andchemical structure of the bases allowhydrogen bonds to form efficiently onlybetween A and T and between G and C.where atoms that are able to form hydrogenbonds (see Panel 2-3, pp. 1 10-1 1 1) can bebrought close together without distortingthe double hel ix. As indicated, twohydrogen bonds form between A and T,while three form between G and C.Thebases can pair in this way only i f the twopolynucleotide chains that contain themare antiparal lel to each other.

3',s',

H

N - C_ C C - N\ /

\ l I H - N

o\\

' \ - L

C \\N

\\C - C C _

,-n, , ,o' l [n,,thyminesugar-phosphate

backboneH

Ha d e n i n e

N -Hilili l i l i lO

l l. l lguanrne / H

hydrogenDOnO

cytosine

Page 7: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

DNA, RNA, Protein

! Central Dogma in molecular biology • DNA • RNA • Protein

! In general, proteins do most of the work, and are encoded by subsequences of DNA, known as genes.

! However, only less than 2% of the human genome codes for proteins.

7

Page 8: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Most of the genome are non-coding

8© 2005 Nature Publishing Group

SINEs

LINEs

Protein-codinggenes

Introns

Miscellaneousunique sequences

Miscellaneousheterochromatin

Segmentalduplications

Simple sequencerepeats

DNA transposonsLTR retrotransposons

20.4%

13.1%

1.5%

25.9%

11.6%

8%

5%

3%2.9%

8.3%

At least 40 different transposable-element families are represented by young, recently active elements in the pufferfish Takifugu rubripes (formerly known as Fugu rubripes), despite its genome being among the smallest in vertebrates. But even the most com-mon type, the LINE element Maui, is present in only 6,400 copies32. In the second pufferfish to be sequenced, Tetraodon nigroviridis, only 4,000 transposable-element copies are found in total — but this still represents 73 different types of element33. Genomes that have a higher proportion of DNA transposons, such as in Drosophila melanogaster and Arabidopsis thaliana, contain elements of more recent origin that are derived from more families than in mammals16;

this is explained by the fact that DNA transposons tend to be more short-lived and to spread by hori-zontal transfer. The genome of D. melanogaster, for example, contains about 130 different transposable-element families (including 25 non-LTR and 28 LTR families), all of which are younger than 20 million years (Myr) REF. 34.

There is therefore convincing evidence that many smaller genomes contain a surprisingly high diversity of transposable-element families. It also now seems that the diversity of lineages within individual trans-posable-element families might be higher in some smaller genomes. In mammals, the abundant LINE1 elements tend to be represented by a single lineage,

Box 3 | The main components of eukaryotic genomes

Protein-coding genesAlthough most prokaryotic chromosomes consist almost entirely of protein-coding genes86, such elements make up a small fraction of most eukaryotic genomes (see figure). As a prime example, the human genome might contain as few as 20,000 genes, comprising less than 1.5% of the total genome sequence16,82.

IntronsShortly after their discovery, the non-coding intervening sequences within coding genes (introns) were suggested to account for the pronounced discrepancy between gene number and genome size7. It has also recently been suggested that most non-coding DNA in animals (but not plants) is intronic, which would imply that most of the genome is transcribed even though protein-coding regions represent a tiny minority107,108. At the very least, introns were found to account for more than a quarter of the draft human sequence16. Over a broad taxonomic scale, intron size and genome size are positively correlated109, although within genera a correlation might (for example, Drosophila110) or might not (for example, Gossypium111) be observed.

PseudogenesNon-functional copies of coding genes, the original meaning of the term ‘junk DNA’, were once thought to explain variation in genome size4. However, it is now apparent that even in combination, ‘classical pseudogenes’ (direct DNA to DNA duplicates), ‘processed pseudogenes’ (copies that are reverse transcribed back into the genome from RNA and therefore lack introns) and ‘Numts’ (nuclear pseudogenes of mitochondrial origin) comprise a relatively small portion of mammalian genomes. The human genome is estimated to contain about 19,000 pseudogenes46.

Transposable elementsIn eukaryotes, transposable elements are divided into two general classes according to their mode of transposition. Class I elements transpose through an RNA intermediate. This class comprises long interspersed nuclear elements (LINEs), endogenous retroviruses, short interspersed nuclear elements (SINEs) and long terminal repeat (LTR) retrotransposons. Class II elements transpose directly from DNA to DNA, and include DNA transposons and miniature inverted repeat transposable elements (MITEs).

Transposable elements (and especially their extinct remnants) make up a large portion of the human genome, with some elements (for example, the SINE Alu element) present in more than a million copies. Transposable-element evolution involves complex interactions with the host genome and other subgenomic elements, ranging from parasitism to mutualism. For a review of transposable-element structure, origins, impacts and evolution see REF. 17.

The figure provides a summary of the different components of the human genome. Less than 1.5% of the genome consists of the suspected 20,000–25,000 protein-coding sequences. By contrast, a large majority is made up of non-coding sequences such as introns (almost 26%) and (mostly defunct) transposable elements (nearly 45%). Data are taken from REF. 16.

702 | SEPTEMBER 2005 | VOLUME 6 www.nature.com/reviews/genetics

R E V I E W S

Nat Rev Genet, 2005

Page 9: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Most functional information is non-coding

! 5% highly conserved, but only 1.5% encodes proteins

9

chr2 (q31.1) 21 p14 2p12 13 31.1 q34 q35

chr2:

DLX1DLX2

Vertebrate Cons

ChimpRhesus

BushbabyTree_shrew

MouseRat

Guinea_PigShrew

HedgehogDogCat

HorseCow

ArmadilloElephant

TenrecOpossumPlatypus

LizardChicken

ZebrafishTetraodon

FuguStickleback

Medaka

172660000 172665000 172670000 172675000UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics

Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)

DLX1

GapsHumanChimp

RhesusBushbaby

Tree_shrewMouse

RatGuinea_Pig

ShrewHedgehog

DogCat

HorseCow

ArmadilloElephant

TenrecOpossumPlatypus

LizardChicken

ZebrafishTetraodon

FuguStickleback

Medaka

UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics

Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)K P R T I Y S S L Q L Q A L N

1A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A T T C C A G C T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A TA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CC C C C C T A G G A C A A T T T A T T C C A G T T T G C A G C T G G A C G C T T T G A A TA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A G C C C A G G A C A A T C T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A C T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A CA A A C C T A G G A C G A T T T A T T C C A G T T T G C A G C T G C A G G C T T T G A A TA A A C C C A G G A C T A T T T A T T C C A G T C T G C A G T T G C A G G C T T T G A A CA A A C C C A G G A C T A T A T A T T C C A G T T T G C A G T T G C A G G C A T T G A A CA A G C C G C G C A C C A T C T A C T C C A G C C T C C A G C T C C A G G C C T T G A A CA A A C C C A G G A C T A T T T A T T C C A G T T T G C A G C T G C A G G C T C T G A A CA A G C C C C G G A C C A T A T A C T C C A G T C T C C A G C T G C A G G C T C T G A A CA A A C C C A G G A C T A T C T A T T C C A G T T T A C A G C T C C A G G C C C T G A A CA A A C C A A G G A C T A T C T A T T C A A G T T T A C A A C T C C A A G C C C T G A A CA A A C C A A G G A C T A T C T A T T C C A G T T T A C A A C T T C A A G C T C T A A A CA A A C C A A G G A C T A T A T A T T C C A G T T T A C A G C T T C A G G C T C T G A A C

What do they do?

Page 10: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Annotating the non-coding regions

10

Scalechr2:

NKI LADs (Tig3)

10 kb hg1920,090,000 20,095,000 20,100,000 20,105,000

TTC32

LaminB1 (Tig3)2 -

-2 _

GM78 CHD2 IgM889 -

1 _

GM78 Pol2 IgM156.2 -

0 _

GM78 Pol2 Std259.8 -

0 _

GM78 Rad2 IgR8.7 -

0 _

GM78 TBP IgM40.1 -

0 _

GM78 Z274 Std16 -

1 _

K562 CHD2 IgR1785 -

1 _

K562 Pol2 IgM27.9 -

0 _

K562 IFa3 Pol2 Sd211.5 -

0 _

K562 IFa6 Pol2 Sd199.4 -

0 _

K562 IFg3 Pol2 Sd241.7 -

0 _

K562 IFg6 Pol2 Sd261.1 -

0 _

K562 Pol2 Std343.1 -

0 _

K562 Rad2 Std8.6 -

0 _

K562 TBP IgM397 -

1 _

K562 Z274 UCD5.4 -

0 _

NATURE METHODS | VOL.9 NO.3 | MARCH 2012 | 215

CORRESPONDENCE

ChromHMM outputs both the learned chromatin-state model parameters and the chromatin-state assignments for each genom-ic position. The learned emission and transition parameters are returned in both text and image format (Fig. 1), automatically grouping chromatin states with similar emission parameters or proximal genomic locations, although a user-specified reordering can also be used (Supplementary Figs. 1–2 and Supplementary Note). ChromHMM enables the study of the likely biological roles of each chromatin state based on enrichment in diverse external annotations and experimental data, shown as heat maps and tables (Fig. 1), both for direct genomic overlap and at vari-ous distances from a chromatin state (Supplementary Fig. 3). ChromHMM also generates custom genome browser tracks6 that show the resulting chromatin-state segmentation in dense view (single color-coded track) or expanded view (each state shown separately) (Fig. 1). All the files ChromHMM produces by default are summarized on a webpage (Supplementary Data).

ChromHMM also enables the analysis of chromatin states across multiple cell types. When the chromatin marks are com-mon across the cell types, a common model can be learned by a virtual ‘concatenation’ of the chromosomes of all cell types. Alternatively a model can be learned by a virtual ‘stacking’ of all marks across cell types, or independent models can be learned in each cell type. Lastly, ChromHMM supports the comparison of models with different number of chromatin states based on cor-relations in their emission parameters (Supplementary Fig. 4).

We wrote the software in Java, which allows it to be run on virtually any computer. ChromHMM and additional documenta-tion is freely available at http://compbio.mit.edu/ChromHMM/.

ChromHMM: automating chromatin-state discovery and characterizationTo the Editor: Chromatin-state annotation using combinations of chromatin modification patterns has emerged as a powerful approach for discovering regulatory regions and their cell type–specific activity patterns and for interpreting disease-association studies1–5. However, the computational challenge of learning chromatin-state models from large numbers of chromatin modi-fication datasets in multiple cell types still requires extensive bio-informatics expertise. To address this challenge, we developed ChromHMM, an automated computational system for learning chromatin states, characterizing their biological functions and correlations with large-scale functional datasets and visualizing the resulting genome-wide maps of chromatin-state annotations.

ChromHMM is based on a multivariate hidden Markov model that models the observed combination of chromatin marks using a product of independent Bernoulli random variables2, which enables robust learning of complex patterns of many chromatin modifications. As input, it receives a list of aligned reads for each chromatin mark, which are automatically converted into pres-ence or absence calls for each mark across the genome, based on a Poisson background distribution. One can use an optional addi-tional input of aligned reads for a control dataset to either adjust the threshold for present or absent calls, or as an additional input mark. Alternatively, the user can input files that contain calls from an independent peak caller. By default, chromatin states are ana-lyzed at 200-base-pair intervals that roughly approximate nucleo-some sizes, but smaller or larger windows can be specified. We also developed an improved parameter-initialization proce-dure that enables relatively efficient infer-ence of comparable models across differ-ent numbers of states (Supplementary Note).

Figure 1 | Sample outputs of ChromHMM. (a) Example of chromatin-state annotation tracks produced from ChromHMM and visualized in the UCSC genome browser6, including dense view (top; single track), expanded view (bottom; separate tracks). (b,c) Heat maps for model parameters (b) and for chromatin-state functional enrichments (c). The columns indicate the relative percentage of the genome represented by each chromatin state and relative fold enrichment for several types of annotation. CTCF, CTC-binding factor; WCE, whole-cell extract; TSS, transcription start site; TES, transcript end site; and GM12878 is a lymphoblastoid cell line.Data in this example correspond to a previous model learned across nine cell types3.

Scalechr4:

GM12878

1_Active_Promoter2_Weak_Promoter

3_Poised_Promoter4_Strong_Enhancer5_Strong_Enhancer6_Weak_Enhancer7_Weak_Enhancer

8_Insulator9_Txn_Transition

10_Txn_Elongation11_Weak_Txn12_Repressed

13_Heterochrom/lo14_Repetitive/CNV15_Repetitive/CNV

50 kb103650000 103700000 103750000

RefSeq Genes

GM12878 (User ordered)

GM12878 (User ordered)

NFKB1NFKB1

MANBA

a

b cEmission parameters

Sta

te (

user

ord

er)

Sta

te (

user

ord

er)

Sta

te fr

om (

user

ord

er)

Transition parameters

Mark

CT

CF

H3K

27m

e3H

3K36

me3

H4K

20m

e1H

3K4m

e1H

3K4m

e2H

3K4m

e3H

3K27

acH

3K9a

cW

CE

Gen

ome

(%)

Ref

Seq

TS

SC

pG is

land

Ref

Seq

TS

S 2

kb

Ref

Seq

exo

nR

efS

eq g

ene

Ref

Seq

TE

SC

onse

rved

Lam

ina

State to (user order)

Category

123456789

101112131415

123456789

101112131415

123456789

101112131415

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

GM12878 fold enrichments

ChromHMM — Ernst and Kellis, Nature Methods 2012

NATURE METHODS | VOL.9 NO.3 | MARCH 2012 | 215

CORRESPONDENCE

ChromHMM outputs both the learned chromatin-state model parameters and the chromatin-state assignments for each genom-ic position. The learned emission and transition parameters are returned in both text and image format (Fig. 1), automatically grouping chromatin states with similar emission parameters or proximal genomic locations, although a user-specified reordering can also be used (Supplementary Figs. 1–2 and Supplementary Note). ChromHMM enables the study of the likely biological roles of each chromatin state based on enrichment in diverse external annotations and experimental data, shown as heat maps and tables (Fig. 1), both for direct genomic overlap and at vari-ous distances from a chromatin state (Supplementary Fig. 3). ChromHMM also generates custom genome browser tracks6 that show the resulting chromatin-state segmentation in dense view (single color-coded track) or expanded view (each state shown separately) (Fig. 1). All the files ChromHMM produces by default are summarized on a webpage (Supplementary Data).

ChromHMM also enables the analysis of chromatin states across multiple cell types. When the chromatin marks are com-mon across the cell types, a common model can be learned by a virtual ‘concatenation’ of the chromosomes of all cell types. Alternatively a model can be learned by a virtual ‘stacking’ of all marks across cell types, or independent models can be learned in each cell type. Lastly, ChromHMM supports the comparison of models with different number of chromatin states based on cor-relations in their emission parameters (Supplementary Fig. 4).

We wrote the software in Java, which allows it to be run on virtually any computer. ChromHMM and additional documenta-tion is freely available at http://compbio.mit.edu/ChromHMM/.

ChromHMM: automating chromatin-state discovery and characterizationTo the Editor: Chromatin-state annotation using combinations of chromatin modification patterns has emerged as a powerful approach for discovering regulatory regions and their cell type–specific activity patterns and for interpreting disease-association studies1–5. However, the computational challenge of learning chromatin-state models from large numbers of chromatin modi-fication datasets in multiple cell types still requires extensive bio-informatics expertise. To address this challenge, we developed ChromHMM, an automated computational system for learning chromatin states, characterizing their biological functions and correlations with large-scale functional datasets and visualizing the resulting genome-wide maps of chromatin-state annotations.

ChromHMM is based on a multivariate hidden Markov model that models the observed combination of chromatin marks using a product of independent Bernoulli random variables2, which enables robust learning of complex patterns of many chromatin modifications. As input, it receives a list of aligned reads for each chromatin mark, which are automatically converted into pres-ence or absence calls for each mark across the genome, based on a Poisson background distribution. One can use an optional addi-tional input of aligned reads for a control dataset to either adjust the threshold for present or absent calls, or as an additional input mark. Alternatively, the user can input files that contain calls from an independent peak caller. By default, chromatin states are ana-lyzed at 200-base-pair intervals that roughly approximate nucleo-some sizes, but smaller or larger windows can be specified. We also developed an improved parameter-initialization proce-dure that enables relatively efficient infer-ence of comparable models across differ-ent numbers of states (Supplementary Note).

Figure 1 | Sample outputs of ChromHMM. (a) Example of chromatin-state annotation tracks produced from ChromHMM and visualized in the UCSC genome browser6, including dense view (top; single track), expanded view (bottom; separate tracks). (b,c) Heat maps for model parameters (b) and for chromatin-state functional enrichments (c). The columns indicate the relative percentage of the genome represented by each chromatin state and relative fold enrichment for several types of annotation. CTCF, CTC-binding factor; WCE, whole-cell extract; TSS, transcription start site; TES, transcript end site; and GM12878 is a lymphoblastoid cell line.Data in this example correspond to a previous model learned across nine cell types3.

Scalechr4:

GM12878

1_Active_Promoter2_Weak_Promoter

3_Poised_Promoter4_Strong_Enhancer5_Strong_Enhancer6_Weak_Enhancer7_Weak_Enhancer

8_Insulator9_Txn_Transition

10_Txn_Elongation11_Weak_Txn12_Repressed

13_Heterochrom/lo14_Repetitive/CNV15_Repetitive/CNV

50 kb103650000 103700000 103750000

RefSeq Genes

GM12878 (User ordered)

GM12878 (User ordered)

NFKB1NFKB1

MANBA

a

b cEmission parameters

Sta

te (

user

ord

er)

Sta

te (

user

ord

er)

Sta

te fr

om (

user

ord

er)

Transition parameters

Mark

CT

CF

H3K

27m

e3H

3K36

me3

H4K

20m

e1H

3K4m

e1H

3K4m

e2H

3K4m

e3H

3K27

acH

3K9a

cW

CE

Gen

ome

(%)

Ref

Seq

TS

SC

pG is

land

Ref

Seq

TS

S 2

kb

Ref

Seq

exo

nR

efS

eq g

ene

Ref

Seq

TE

SC

onse

rved

Lam

ina

State to (user order)

Category

123456789

101112131415

123456789

101112131415

123456789

101112131415

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

GM12878 fold enrichments

Page 11: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Cancer genomics workflow

Page 12: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Each type of cancer is different

12

divergence time. The number of mutations hasbeen measured in tumors representing progressivestages of colorectal and pancreatic cancers (11, 16).Applying the evolutionary clock model to thesedata leads to two unambiguous conclusions: First,it takes decades to develop a full-blown, meta-static cancer. Second, virtually all of themutationsin metastatic lesions were already present in alarge number of cells in the primary tumors.

The timing of mutations is relevant to ourunderstanding of metastasis, which is responsiblefor the death of most patients with cancer. Theprimary tumor can be surgically removed, but theresidual metastatic lesions—often undetectable andwidespread—remain and eventually enlarge, com-promising the function of the lungs, liver, or otherorgans. From a genetics perspective, it wouldseem that there must be mutations that convert aprimary cancer to a metastatic one, just as thereare mutations that convert a normal cell to a be-nign tumor, or a benign tumor to a malignant one(Fig. 2). Despite intensive effort, however, con-sistent genetic alterations that distinguish cancersthat metastasize from cancers that have not yetmetastasized remain to be identified.

One potential explanation invokes mutationsor epigenetic changes that are difficult to iden-tify with current technologies (see section on “darkmatter” below). Another explanation is that meta-static lesions have not yet been studied in suf-ficient detail to identify these genetic alterations,particularly if the mutations are heterogeneousin nature. But another possible explanation isthat there are no metastasis genes. A malignantprimary tumor can take many years to metasta-size, but this process is, in principle, explicableby stochastic processes alone (17, 18). Advancedtumors release millions of cells into the circula-tion each day, but these cells have short half-lives,and only a miniscule fraction establish metastaticlesions (19). Conceivably, these circulating cellsmay, in a nondeterministic manner, infrequentlyand randomly lodge in a capillary bed in an organthat provides a favorable microenvironment forgrowth. The bigger the primary tumor mass, themore likely that this process will occur. In thisscenario, the continual evolution of the primarytumor would reflect local selective advantagesrather than future selective advantages. The ideathat growth at metastatic sites is not dependent onadditional genetic alterations is also supported byrecent results showing that even normal cells,when placed in suitable environments such aslymph nodes, can grow into organoids, completewith a functioning vasculature (20).

Other Types of Genetic Alterations in TumorsThough the rate of point mutations in tumors issimilar to that of normal cells, the rate of chro-mosomal changes in cancer is elevated (21).Therefore, most solid tumors display widespreadchanges in chromosome number (aneuploidy),as well as deletions, inversions, translocations,

1500

1000

500

Col

orec

tal (

MS

I)

Lung

(S

CLC

)

Lung

(N

SC

LC)

Mel

anom

a

Eso

phag

eal (

ES

CC

)

Non

-Hod

gkin

lym

phom

a

Col

orec

tal (

MS

S)

Hea

d an

d ne

ck

Eso

phag

eal (

EA

C)

Gas

tric

End

omet

rial (

endo

met

rioid

)

Pan

crea

tic a

deno

carc

inom

a

Ova

rian

(hig

h-gr

ade

sero

us)

Pro

stat

e

Hep

atoc

ellu

lar

Glio

blas

tom

a

Bre

ast

End

omet

rial (

sero

us)

Lung

(ne

ver

smok

ed N

SC

LC)

Chr

onic

lym

phoc

ytic

leuk

emia

Acu

te m

yelo

id le

ukem

ia

Glio

blas

tom

a

Neu

robl

asto

ma

Acu

te ly

mph

obla

stic

leuk

emia

Med

ullo

blas

tom

a

Rha

bdoi

d

Mutagens

Non

-syn

onym

ous

mut

atio

ns p

er tu

mor

(med

ian

+/-

one

quar

tile)

250

225

200

175

150

125

100

50

75

25

0

A

B

Adult solid tumors Liquid Pediatric

Fig. 1. Number of somatic mutations in representative human cancers, detected by genome-wide sequencing studies. (A) The genomes of a diverse group of adult (right) and pediatric (left)cancers have been analyzed. Numbers in parentheses indicate the median number of nonsynonymousmutations per tumor. (B) The median number of nonsynonymous mutations per tumor in a variety oftumor types. Horizontal bars indicate the 25 and 75% quartiles. MSI, microsatellite instability; SCLC,small cell lung cancers; NSCLC, non–small cell lung cancers; ESCC, esophageal squamous cell carcinomas;MSS, microsatellite stable; EAC, esophageal adenocarcinomas. The published data on which this figure isbased are provided in table S1C.

www.sciencemag.org SCIENCE VOL 339 29 MARCH 2013 1547

SPECIALSECTION

CRE

DIT:F

IG.1

A,E

.COOK

Number of nonsynonymous mutations in representative human cancers, detected by genome-wide sequencing studies.

Vogelstein et al. Science 2013

Page 13: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Each individual tumor is different

! Data from TCGA’s analyses show that most cancer types has a great number of mutations that occur at a low frequency.

! Long-tail distribution

13

2www.nature.com/nature

doi: 10.1038/nature08645 SUPPLEMENTARY INFORMATION

������ ������� ������ � ���� � ���� � ����

������� ����� � ����� � ���� � ���� � �����

������ ������������������������������������ ����� � ����� � ������������ � �����

������� ����������������������������������� ����� � ���� � ��� � �����

SI Guide

Supplementary Figure 1 Haploid physical coverage of breast cancer samples. Physical

coverage indicates the number of DNA fragments of which both ends have been sequenced

that on average overlie any position in the genome.

Supplementary Figure 2 Genome wide circos plots of somatic rearrangements in all 24

breast cancers in the study.

Supplementary Figure 3 NFIA-EHF, an expressed, in frame fusion gene caused by an

interchromosomal rearrangement in breast cancer cell line HCC1937. (a) Across

rearrangement PCR to confirm the presence of the somatic rearrangement in the cancer but

not in normal DNA; (b) RT-PCR of RNA between NFIA exon 2 and EHF exon 5 to confirm

the presence of a chimeric expressed transcript; (c) Representative picture of dual colour

FISH confirming a translocation in HCC1937. Red probe corresponds to BAC RP11-

364M11, chromosome 1: 61,064,196-61,228,554. Green probe corresponds to BAC RP11-

277N08, chromosome 11: 34,772,104-34,965,946. (d) Schematic diagram of the protein

domains fused in the predicted NFIA/EHF fusion protein. Domains from NFIA are blue,

domains from EHF are red (e) Sequence from RT-PCR product shown in (b) confirming

NFIA exon 2 fused to EHF exon 5.

Supplementary Figure 4 SLC26A6-PRKAR2A, an expressed, in-frame fusion gene

generated by a tandem duplication in the breast cancer cell line HCC38. (a) Across

rearrangement PCR to confirm the presence of the somatic rearrangement in the cancer but

not in normal DNA; (b) RT-PCR of RNA between SLC26A6 exon 17 and PRKAR2A exon 4

to confirm the presence of a chimeric expressed transcript; (c) Dual colour FISH confirming

the 3p21.31 tandem duplication in HCC38. Green-labelled BAC RP11-148G20 is within the

tandem duplication. Red-labelled BAC RP11-527M10 is located ~3 Mb telomeric of the

tandem duplication. (d) Schematic diagram of the protein domains in the predicted

SLC26A6-PRKAR2A fusion protein. Domains from SLC26A6 are blue, domains from

PRKAR2A are red. (e) Sequence from RT-PCR product shown in (b) confirming SLC26A6

exon 17 fused to PRKAR2A exon 4.

Supplementary Figure 5 Prevalence of architectures of rearrangements in all 24 breast

cancers in the study: Deletion (dark blue), tandem duplication (red), inverted orientation

(green), interchromosomal (light blue), breakpoint(s) within an amplicon (orange).

Stephens et al. Nature 2009

Page 14: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Supervised learning Un-supervised learning

genes

samples

Analyzing gene expression data

Page 15: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

How to deal with high dimension? Identify the most important genes

! d is the damping factor, a parameter representing the extent to which the ranking depends on the structure of the graph.

! f is the prior probability of the gene which we set to the absolute differential expression.

! is the in-degree of i

15

Gene Network

Gene Expression

Somatic Alteration Data(SNP, CNV, etc.)

Ranks of Genes

▪ A ranking framework based on PageRank that considers the impact of genes in the network

▪ Impact includes connectivity and the amount downstream genes to be differentially expressed

▪ Dynamic damping factor is used to improve the original PageRank in ranking genes

DawnRank

Personalized Driver Alterations

rt+1j = (1� dj)fj + dj

NX

i=1

Ajirtidegi

degi =PN

j=1 Aji

Hou and Ma, Genome Med 2014

Page 16: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Tumor heterogeneity vs. gene networks

16

NCIS - Liu et al. BMC Bioinfo 2014 C3 - Hou et al. Bioinformatics 2016 LDGM - Tian, Gu, and Ma, Nucleic Acids Res 2016

������

��

����������

��

� ��

������

���

����

�������

���

����

���������������

���� �����

�� ��� ���� ���� �����������������������������

������������������������������� ����

���������� �����������

����������������� �

������ ���� �����

������ �������������

���

NRAS

GABBR1

ATF2

MAPK1

PRKACAGNAI2

PRKACB

CREB3L4ADCY2

KCNJ3

PLCB4

GRB2

GNAI3

SRC

PIK3CD

CALML6

ESR1

GABBR2

ADCY4

FOSADCY3

NOS3

PLCB2

OPRM1

AKT1

GNAS

CREB3

PIK3CA

HRAS

PLCB3KCNJ6CREB3L1

GNAO1

SHC1

MAP2K1

PIK3R5

ADCY5MAPK3PLCB1PIK3R3

SOS1

GNAI1CALML3

MMP2PRKACGPRKCD

CREB3L2HBEGF

SHC4

PIK3CB

AKT3CREB5

GRM1ADCY1

MMP9

EGFRJUN

ADCY7

ATF6B

SHC2

PIK3R1

CALM1

SOS2

ADCY9ATF4

PIK3R2SHC3

SP1

# interactions

# interactions

ES

R1

degr

eeR

ank

of E

SR

1

A

B

C

Lumina A

Basal-like

% degree from Luminal A

% degree from Basal-like

LDGMGlasso

JGLCNJGL

Figure 5: Differential networks on estrogen signaling pathway reconstructed based on gene expression data from breastcancer Luminal A and Basal-like subtypes. (A) The degree of ESR1 in estimated differential networks with increasednumber of interactions. (B) The rank of ESR1 by its degree in differential networks. The number of interactions is up to1,000 in (A) and (B). (C) A differential network b� estimated by LDGM with � = 0.362. Node size is proportional to thenode’s degree. Width of an interaction i � j is proportional to the score |b�ij |. The origin of interactions in the differentialnetwork is inferred by a principle of majority approach based on Glasso (see Supplementary Text).

20

J_ID: BIOINFO Customer A_ID: BTW546 Copyedited by: PR Manuscript Category: Original Paper Cadmus Art: OP-CB

produced by C3 for small k: the IQR (Interquartile range) value is

roughly 0.35 for k¼5. The most likely reason behind this result is

that our test weights were chosen to boost the relevance of mutual-

exclusivity and biological significance rather than coverage. Mutual5 exclusivity accounts for 100% of the negative weights of edges,

while coverage accounts for only 16.7% of the positive weights. We

justify this weight choice by the fact that it leads to multiple signifi-

cant cluster discovery and with our assumption that coverage is a

less significant driver property compared to mutual exclusivity. We10 also point out that it appears that a biologically more relevant cover-

age constraint is pathway coverage, rather than patient sample

coverage.

As already mentioned in the previous sections, one advantage of

C3 is that the user can adjust the weights according to her/his own15 belief about the significance of patient coverage. For example, by

changing the averaging weights in our GBM run to w1 ¼ 0:60

(coverage), w2 ¼ 0:20 (network) and w3 ¼ 0:20 (expression), we

obtain a coverage percentage of 0.7903 for k¼5. However, this ex-

cellent coverage comes at a cost of a less significant mutual exclusiv-20 ity score (fractional value 0.4288) and a lower proportion of

detected drivers (fractional value 0.1267). As may be seen from the

above example, C3 can be adapted to the user’s specification to best

reflect the scope and preferences of the analysis.

Another setting in which we analyzed C3 and CoMEt involves25 pairwise distances of drivers in the network (see Fig. 3D). Here, we

calculated the average pairwise distance between all pairs of genes

clustered together. We then used Student’s t-test to determine the

statistical significance of this value. We also compared the values for

both algorithms based on 1000 randomly selected genes by using a30 permutation test. For BRCA, we found no significant performance

difference between the two methods in terms of the average pairwise

distance: 3.110 for C3 and 3.070 for CoMEt, with a P-value of

0.9330. In GBM, C3 showed a smaller average pairwise distance of

2.908 compared to CoMEt’s 3.097. This difference is statistically35 significant, with a P-value of 0.0379. The small average network

distance results of C3 for GBM, coupled with the low coverage,

leads to the conclusion that C3 favors niche, exclusive clusters in

biologically relevant cancer pathways. Hence, the method may be

useful for discovering specific molecular cancer subtypes. Both40 methods had an average pairwise distance well below the permuta-

tion benchmark of 3.903: the P-values of both C3 and CoMEt were

less than 2" 10#16 for both cancers.

In conclusion, from our detailed evaluation we conclude that al-

though C3 does not simultaneously outperform CoMEt with respect45 to all four evaluation criteria, but only three of them (which already

represents a significant advantage), the C3 performance indicates a

strong overall propensity to select biologically more relevant and

more mutually exclusive clusters, with a higher degree of flexibility

compared to CoMEt.

50 4.2 Discovering potential driver pathwaysWe examine next the potential of the C3 algorithm to detect clusters

whose genes may be new candidate cancer drivers. We focus our

search on clusters that contain biologically significant driver genes

and known biological network interactions, and exhibit high mutual55 exclusivity and coverage. At the same time, we only consider the large

cluster size regime, as results in this domain have not been previously

reported in the literature and as they offer many new interesting in-

sights. Two examples of our analysis are shown in Figures 4 and 5.

In BRCA, one candidate cluster with several potential novel60 driver genes is the cluster containing PTEN, HUWE1, CNTNAP2,

GRID2, CACNA1B, CYSLTR2, MYH1 depicted in Figure 4. The

genes in the candidate cluster are mutually exclusive

(P# value ¼ 0:0084). The genome landscape of this cluster is domi-

nated primarily by mutations in PTEN and HUWE1, and secondar-65ily by homozygous deletions in PTEN and CYSLTR2. The most

Fig. 4. A cluster of potential driver genes inferred from BRCA. (A) The alter-

ation landscape of the cluster, with blue representing mutation events, red

representing copy number deletions and green representing copy number

amplifications. (B) A known subnetwork which contains 6 genes (out of 7) in

(A). The more intense the red, the higher the alteration frequency of the gene.

Nodes highlighted in black represent driver candidates identified by C3 within

a small subnetwork. Edges are depicted in black if there exists a direct inter-

action between two genes. Green edges represent an interaction that under-

goes a protein state change. Purple edges are other interactions

Fig. 5. A cluster of potential driver genes inferred from GBM. (A) The alter-

ation landscape of the cluster, with blue representing mutation events, red

representing copy number deletions and green representing copy number

amplifications. (B) A known subnetwork which contains 6 genes (out of 10) in

(A). The more intense the red, the higher the alteration frequency of the gene.

Nodes highlighted in black represent driver candidates identified by C3 within

a small subnetwork. Edges are depicted in black if there exists a direct inter-

action between two genes. Green edges represent an interaction that under-

goes a protein state change. Purple edges are other interactions

10 J.P.Hou et al.

Page 17: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Deep learning applications

17

process (Box 1). Deep learning is now one of the most active fields in

machine learning and has been shown to improve performance

in image and speech recognition (Hinton et al, 2012; Krizhevsky et al,

2012; Graves et al, 2013; Zeiler & Fergus, 2014; Deng & Togneri, 2015),

natural language understanding (Bahdanau et al, 2014; Sutskever

et al, 2014; Lipton, 2015; Xiong et al, 2016), and most recently, in

computational biology (Eickholt & Cheng, 2013; Dahl et al, 2014;

Leung et al, 2014; Sønderby & Winther, 2014; Alipanahi et al, 2015;

Wang et al, 2015; Zhou & Troyanskaya, 2015; Kelley et al, 2016).

The potential of deep learning in high-throughput biology is

clear: in principle, it allows to better exploit the availability of

increasingly large and high-dimensional data sets (e.g. from DNA

sequencing, RNA measurements, flow cytometry or automated

microscopy) by training complex networks with multiple layers that

capture their internal structure (Fig 1C and D). The learned

networks discover high-level features, improve performance over

traditional models, increase interpretability and provide additional

understanding about the structure of the biological data.

In this review, we discuss recent and forthcoming applications of

deep learning, with a focus on applications in regulatory genomics

and biological image analysis. The goal of this review was not to

provide comprehensive background on all technical details, which

can be found in the more specialized literature (Bengio, 2012;

Bengio et al, 2013; Deng, 2014; Schmidhuber, 2015; Goodfellow

et al, 2016). Instead, we aimed to provide practical pointers and the

necessary background to get started with deep architectures, review

current software solutions and give recommendations for applying

them to data. The applications we cover are deliberately broad to

illustrate differences and commonalities between approaches;

reviews focusing on specific domains can be found elsewhere (Park

& Kellis, 2015; Gawehn et al, 2016; Leung et al, 2016; Mamoshina

et al, 2016). Finally, we discuss both the potential and possible

pitfalls of deep learning and contrast these methods to traditional

machine learning and classical statistical analysis approaches.

Deep learning for regulatory genomics

Conventional approaches for regulatory genomics relate sequence

variation to changes in molecular traits. One approach is to leverage

variation between genetically diverse individuals to map quantitative

trait loci (QTL). This principle has been applied to identify regulatory

variants that affect gene expression levels (Montgomery et al, 2010;

Pickrell et al, 2010), DNA methylation (Gibbs et al, 2010; Bell et al,

2011), histone marks (Grubert et al, 2015; Waszak et al, 2015) and

proteome variation (Vincent et al, 2010; Albert et al, 2014; Parts et al,

2014; Battle et al, 2015) (Fig 2A). Better statistical methods have

helped to increase the power to detect regulatory QTL (Kang et al,

2008; Stegle et al, 2010; Parts et al, 2011; Rakitsch & Stegle, 2016);

however, any mapping approach is intrinsically limited to variation that

is present in the training population. Thus, studying the effects of rare

mutations in particular requires data sets with very large sample size.

An alternative is to train models that use variation between

regions within a genome (Fig 2A). Splitting the sequence into

windows centred on the trait of interest gives rise to tens of thou-

sands of training examples for most molecular traits even when

using a single individual. Even with large data sets, predicting molec-

ular traits from DNA sequence is challenging due to multiple layers

x y

Features Model Results Clean data

A

D

Featureextraction

Discriminative features

Raw data

Label

C

Intr

on

Exon

Featureextraction Training Evaluation

Supervised Unsupervised

x

• Linear regression• Logistic regression• Random Forest• SVM• …

• PCA• Factor analysis• Clustering• Outlier detection• …

B

A C G T C G C G T A G T C C G T T A G T C G T A G G A G A A

TAGC

TGC

ACG

TGAC

CATG

AGTC

ATGC

TGC

GTC

CGT

AT

CGAT

GTC

CGA

GTACA

CCACC

GAGTG

TGT

CATG

CTA

CAG

CTA

T

GC

GCTA

GCTG

ACTG

AC

TA

TCG

GCT

ATG

CAG

AGC

ACG

A

CGGC

TCG

ATGC

CT

GATC

CCAG

TAGC

TAGC

TACC

AGCC

AGCT

CTGAC

GTC

TACG

ATCG

TGAC

ATCGG

CAGC

AT

GGC

AGC

ATCG

TACG

ATCG

ATGC

ACGT

CGAT

TGA

TAG

ACG

C

GACTG

ATC

ATG

ACTG

TAG

CGTA

GCTA

GCT

CGA

CAT

CGAT

GAT

TCA

TA

GAT

CTA

CGTA

Layer 1 ATG

CAG

AGC

ACG

A

CGGC

TCG

ATGC

CT

GATC

CGTA

GCTA

GCT

CGA

CAT

CGAT

GAT

TCA

TA

GAT

CTA

CGTA

Raw data

Pre-processing

Raw data

Layer 2 Intron ExonTSS

Figure 1. Machine learning and representation learning.(A) The classical machine learning workflow can be broken down into four steps: data pre-processing, feature extraction, model learning and model evaluation. (B) Supervisedmachine learning methods relate input features x to an output label y, whereas unsupervised method learns factors about x without observed labels. (C) Raw input data areoften high-dimensional and related to the corresponding label in a complicated way, which is challenging for many classical machine learning algorithms (left plot).Alternatively, higher-level features extracted using a deep model may be able to better discriminate between classes (right plot). (D) Deep networks use a hierarchicalstructure to learn increasingly abstract feature representations from the raw data.

Molecular Systems Biology 12: 878 | 2016 ª 2016 The Authors

Molecular Systems Biology Deep learning for computational biology Christof Angermueller et al

2

Published online: July 29, 2016

process (Box 1). Deep learning is now one of the most active fields in

machine learning and has been shown to improve performance

in image and speech recognition (Hinton et al, 2012; Krizhevsky et al,

2012; Graves et al, 2013; Zeiler & Fergus, 2014; Deng & Togneri, 2015),

natural language understanding (Bahdanau et al, 2014; Sutskever

et al, 2014; Lipton, 2015; Xiong et al, 2016), and most recently, in

computational biology (Eickholt & Cheng, 2013; Dahl et al, 2014;

Leung et al, 2014; Sønderby & Winther, 2014; Alipanahi et al, 2015;

Wang et al, 2015; Zhou & Troyanskaya, 2015; Kelley et al, 2016).

The potential of deep learning in high-throughput biology is

clear: in principle, it allows to better exploit the availability of

increasingly large and high-dimensional data sets (e.g. from DNA

sequencing, RNA measurements, flow cytometry or automated

microscopy) by training complex networks with multiple layers that

capture their internal structure (Fig 1C and D). The learned

networks discover high-level features, improve performance over

traditional models, increase interpretability and provide additional

understanding about the structure of the biological data.

In this review, we discuss recent and forthcoming applications of

deep learning, with a focus on applications in regulatory genomics

and biological image analysis. The goal of this review was not to

provide comprehensive background on all technical details, which

can be found in the more specialized literature (Bengio, 2012;

Bengio et al, 2013; Deng, 2014; Schmidhuber, 2015; Goodfellow

et al, 2016). Instead, we aimed to provide practical pointers and the

necessary background to get started with deep architectures, review

current software solutions and give recommendations for applying

them to data. The applications we cover are deliberately broad to

illustrate differences and commonalities between approaches;

reviews focusing on specific domains can be found elsewhere (Park

& Kellis, 2015; Gawehn et al, 2016; Leung et al, 2016; Mamoshina

et al, 2016). Finally, we discuss both the potential and possible

pitfalls of deep learning and contrast these methods to traditional

machine learning and classical statistical analysis approaches.

Deep learning for regulatory genomics

Conventional approaches for regulatory genomics relate sequence

variation to changes in molecular traits. One approach is to leverage

variation between genetically diverse individuals to map quantitative

trait loci (QTL). This principle has been applied to identify regulatory

variants that affect gene expression levels (Montgomery et al, 2010;

Pickrell et al, 2010), DNA methylation (Gibbs et al, 2010; Bell et al,

2011), histone marks (Grubert et al, 2015; Waszak et al, 2015) and

proteome variation (Vincent et al, 2010; Albert et al, 2014; Parts et al,

2014; Battle et al, 2015) (Fig 2A). Better statistical methods have

helped to increase the power to detect regulatory QTL (Kang et al,

2008; Stegle et al, 2010; Parts et al, 2011; Rakitsch & Stegle, 2016);

however, any mapping approach is intrinsically limited to variation that

is present in the training population. Thus, studying the effects of rare

mutations in particular requires data sets with very large sample size.

An alternative is to train models that use variation between

regions within a genome (Fig 2A). Splitting the sequence into

windows centred on the trait of interest gives rise to tens of thou-

sands of training examples for most molecular traits even when

using a single individual. Even with large data sets, predicting molec-

ular traits from DNA sequence is challenging due to multiple layers

x y

Features Model Results Clean data

A

D

Featureextraction

Discriminative features

Raw data

Label

C

Intr

on

Exon

Featureextraction Training Evaluation

Supervised Unsupervised

x

• Linear regression• Logistic regression• Random Forest• SVM• …

• PCA• Factor analysis• Clustering• Outlier detection• …

B

A C G T C G C G T A G T C C G T T A G T C G T A G G A G A A

TAGC

TGC

ACG

TGAC

CATG

AGTC

ATGC

TGC

GTC

CGT

AT

CGAT

GTC

CGA

GTACA

CCACC

GAGTG

TGT

CATG

CTA

CAG

CTA

T

GC

GCTA

GCTG

ACTG

AC

TA

TCG

GCT

ATG

CAG

AGC

ACG

A

CGGC

TCG

ATGC

CT

GATC

CCAG

TAGC

TAGC

TACC

AGCC

AGCT

CTGAC

GTC

TACG

ATCG

TGAC

ATCGG

CAGC

AT

GGC

AGC

ATCG

TACG

ATCG

ATGC

ACGT

CGAT

TGA

TAG

ACG

C

GACTG

ATC

ATG

ACTG

TAG

CGTA

GCTA

GCT

CGA

CAT

CGAT

GAT

TCA

TA

GAT

CTA

CGTA

Layer 1 ATG

CAG

AGC

ACG

A

CGGC

TCG

ATGC

CT

GATC

CGTA

GCTA

GCT

CGA

CAT

CGAT

GAT

TCA

TA

GAT

CTA

CGTA

Raw data

Pre-processing

Raw data

Layer 2 Intron ExonTSS

Figure 1. Machine learning and representation learning.(A) The classical machine learning workflow can be broken down into four steps: data pre-processing, feature extraction, model learning and model evaluation. (B) Supervisedmachine learning methods relate input features x to an output label y, whereas unsupervised method learns factors about x without observed labels. (C) Raw input data areoften high-dimensional and related to the corresponding label in a complicated way, which is challenging for many classical machine learning algorithms (left plot).Alternatively, higher-level features extracted using a deep model may be able to better discriminate between classes (right plot). (D) Deep networks use a hierarchicalstructure to learn increasingly abstract feature representations from the raw data.

Molecular Systems Biology 12: 878 | 2016 ª 2016 The Authors

Molecular Systems Biology Deep learning for computational biology Christof Angermueller et al

2

Published online: July 29, 2016

More traditional Machine Learning Applications to Deep Learning Application

Angermueller et al., Mol Sys Bio 2016

Page 18: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

DeepBIND

18

832 VOLUME 33 NUMBER 8 AUGUST 2015 NATURE BIOTECHNOLOGY

A N A LY S I S

identify cumulative effects of short motifs, and the contribution of each is determined automatically by learning. These values are fed into a nonlinear neural network with weights W, which combines the responses to produce a score (Fig. 2a, Supplementary Fig. 1 and Supplementary Notes, sec. 1.1).

We use deep learning techniques13–16 to infer model parameters and to optimize algorithm settings. Our training pipeline (Fig. 2b) alleviates the need for hand tuning, by automatically adjusting many calibration parameters, such as the learning rate, the degree of momentum14, the mini-batch size, the strength of parameter regu-larization, and the dropout probability15.

To obtain the results reported below, we trained DeepBind mod-els on a combined 12 terabases of sequence data, spanning thou-sands of public PBM, RNAcompete, ChIP-seq and HT-SELEX experiments. We provide the source code for DeepBind together with an online repository (http://tools.genes.toronto.edu/deepbind/) of 927 DeepBind models rep-resenting 538 distinct transcription factors and 194 distinct RBPs, each of which was trained on high-quality data and can be applied to score new sequences using an eas-ily installed executable file with no hardware or software requirements.

Ascertaining DNA sequence specificitiesTo evaluate DeepBind’s ability to charac-terize DNA-binding protein specificity, we used PBM data from the revised DREAM5 TF-DNA Motif Recognition Challenge by Weirauch et al.17. The PBM data represent 86 different mouse transcription factors, each measured using two independent array designs. Both designs contain ~40,000 probes that cover all possible 10-mers, and all nonpalindromic 8-mers, 32 times. Participating teams were asked to train on the probe intensities using one array design and to predict the intensities of the held-out array design, which was not made available to participants.

Weirauch et al.17 evaluated 26 algorithms that can be trained on PBM measurements, including FeatureREDUCE17, BEEML-PBM18, MatrixREDUCE19, RankMotif++20 and Seed-and-Wobble21. For each indi-vidual algorithm, they optimized the data preprocessing steps to attain best test per-formance. Methods were evaluated using the Pearson correlation between the predicted and actual probe intensities, and values from the area under the receiver operating char-acteristic (ROC) curve (AUC) computed by setting high-intensity probes as positives and the remaining probes as negatives17. To the best of our knowledge, this is the largest independent evaluation of this type. When we tested DeepBind under the same conditions, it outperformed all 26 methods (Fig. 3a). DeepBind also ranked first among 15 teams

when we submitted it to the online DREAM5 evaluation script (Supplementary Table 1).

To assess the ability of DeepBind models trained using in vitro PBM data to predict sequence specificities measured using in vivo ChIP-seq data, we followed the method described by Weirauch et al.17. Predicting transcription factor binding in vivo is more difficult because it is affected by other proteins, the chromatin state and the physical accessibility of the binding site. We found that DeepBind also achieves the highest score when applied to the in vivo ChIP-seq data (Fig. 3b and Supplementary Fig. 2). The best method reported in the original evaluation (Team_D, a k-mer-based model) and the best reported in the revised evaluation (FeatureREDUCE, a hybrid PWM/k-mer model) both had reasonable, but not the best, perform-ance on in vivo data, which might be due to overfitting to PBM noise17.

Current batch of inputs

Motif scans

Convolve

Motifdetectors

Rectify

Thresholds

Current modelparameters

Pool

Weights

Neural network

Features Outputs

Targets

Parameterupdates

UpdateBackprop

Predictionerrors

Use all training data TrainingAUC

Evaluaterandom

calibrations

(1) (2)

(2)

(2)

(2)

(30)

...

1. Calibrate

0.62

0.95

0.70

... ...

Use bestcalibration

(3 attempts)

2. Train candidates

0.96

0.50

0.97

3. Test final model

Predict 0.93

TestAUC

Use parametersof best candidate

M b w

3-fold cross validation Averagevalidation

AUC

0.70 Test data never seen during calibration or training

Testdata

Trainingdata

Train

Train Train

Train

Validate

Validate

Validate

0.97Train

a

b

Figure 2 Details of inner workings of DeepBind and its training procedure. (a) Five independent sequences being processed in parallel by a single DeepBind model. The convolve, rectify, pool and neural network stages predict a separate score for each sequence using the current model parameters (Supplementary Notes, sec. 1). During the training phase, the backprop and update stages simultaneously update all motifs, thresholds and network weights of the model to improve prediction accuracy. (b) The calibration, training and testing procedure used throughout (Supplementary Notes, sec. 2).

Figure 1 DeepBind’s input data, training procedure and applications. 1. The sequence specificities of DNA- and RNA-binding proteins can now be measured by several types of high-throughput assay, including PBM, SELEX, and ChIP- and CLIP-seq techniques. 2. DeepBind captures these binding specificities from raw sequence data by jointly discovering new sequence motifs along with rules for combining them into a predictive binding score. Graphics processing units (GPUs) are used to automatically train high-quality models, with expert tuning allowed but not required. 3. The resulting DeepBind models can then be used to identify binding sites in test sequences and to score the effects of novel mutations.

Alipanahi et al. Nat Biotech 2015 (Other methods: DeepSEA — Zhou & Troyanskaya, Nat Methods 2015;

DanQ — Quang & Xie, Nucleic Acids Res 2016)

Page 19: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Cancer genome

19

MCF-7 http://www.path.cam.ac.uk/~pawefish/

Page 20: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

20

Structural variations (SVs) in cancer genomes

inversion translocation

gain loss duplication

Whole genome sequencing Methods: DELLY, Meerkat, BreakDancer, CREST, CNVnator, CONSERTING, and many others

Page 21: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Aneuploidy — Common feature of cancer cells

21

MCF-7 http://www.path.cam.ac.uk/~pawefish/

! Allele-specific copy number (ASCN) tools • ABSOLUTE, ASCAT,

Patchwork

! SVs can further modify the aneuploid cancer genome into a mixture of genomic segments with extensive range of CNAs

! We need methods that combine SV and ASCN

! How SVs interact with ASCNs? How different SVs interact with each other?

NATURE GENETICS VOLUME 45 | NUMBER 10 | OCTOBER 2013 1135

A N A LY S I S

We then inferred the sequence of SCNA events that led to each copy number profile, using the most parsimonious set of SCNAs that could generate the observed absolute allelic copy numbers (Online Methods and Supplementary Fig. 1a). We determined the lengths, locations and numbers of copies changed for each SCNA and, in many cases, allelic structure (Supplementary Fig. 1b). We identified a total of 202,244 SCNAs, a median of 39 per cancer sample, comprising 6 categories: focal SCNAs that were shorter than the chromosome arm (median of 11 amplifications and 12 deletions per sample); arm-level SCNAs that were chromosome-arm length or longer (median of 3 amplifications and 5 deletions per sample); copy-neutral loss-of-heterozygosity (LOH) events in which one allele was deleted and the other was amplified coextensively (median of 1 per sample); and whole-genome duplications (WGDs; in 37% of cancers). By ampli-fications and deletions, we refer to copy number gains and losses, respectively, of any length and amplitude.

Estimated purities and ploidies per cancer varied substantially within and across lineages (Fig. 1a). Purity estimates correlated with estimates derived from measurements of leukocyte and lymphocyte contamination using DNA methylation data from the same can-cers (Supplementary Fig. 1c) (H.S., L. Yao, T. Tiche Jr., T. Hinoue, C. Kandoth et al., unpublished data) but tended to indicate lower purity, consistent with the presence of non-hematopoietic contaminating nor-mal cells. Average ploidies within lineages mirrored WGD frequencies. The average estimated ploidy within samples that had undergone a single WGD was 3.31 (not 4), suggesting that WGD events are associ-ated with large amounts of genome loss. By contrast, samples that had not undergone WGD had an average estimated ploidy of 1.99.

Compared to the near-diploid cancers within each lineage, cancers with WGD had higher rates of every other type of SCNA (Fig. 1b) and twice the rate of SCNAs overall. Across lineages, overall SCNA rates largely reflected rates of WGD (Supplementary Fig. 1d).

In cancers with WGD, most other SCNAs occurred after WGD (Fig. 1b and Online Methods). The fractions of amplifications and deletions that were estimated to occur before WGD were highly cor-related across lineages (R = 0.64; Supplementary Fig. 1e), indicating a consistent estimate for the timing of WGD with respect to other SCNAs. WGD was inferred to occur earliest relative to focal SCNAs among lineages where WGD was common (ovarian, bladder and colo-rectal cancers) and after most focal SCNAs in lineages in which WGD was least common (glioblastoma and kidney clear-cell carcinoma).

SCNA lengths suggest varied mechanisms of generationFocal SCNAs for which one boundary is the telomere (telomere bounded) tended to be longer than SCNAs for which both boundaries were internal to the chromosome (median SCNA lengths for telomere-bounded and internal events respectively: amplifications, 19.6 Mb versus 0.9 Mb; deletions, 22.7 Mb versus 0.7 Mb). These differences reflect

differences across the entire length distributions of telomere-bounded and internal events. Focal internal SCNAs were observed at frequen-cies inversely proportional to their lengths (Fig. 2a and Supplementary Fig. 2a,b), as noted previously1. However, telomere-bounded SCNAs tended to follow a superposition of 1/length and uniform length distri-butions. These distributions were the same whether measuring distance by kilobase, number of array markers or number of genes, indicating that this difference in length does not result from variation in array resolution or gene density across the genome (data not shown). Focal, telomere-bounded SCNAs also accounted for more SCNAs than expected assum-ing random SCNA locations (12% and 26% of focal amplifications and deletions, respectively; P < 0.0001). Both telomere-bounded and internal SCNAs were more likely to end within the centromere than expected given the centromere’s length (Supplementary Fig. 2c), but differences in their length distributions remained when centromere-bounded events were excluded. Differences between telomere-bounded and internal SCNAs were even more marked for copy-neutral LOH events and dis-played no correlation across lineages (Supplementary Fig. 2d).

We detected chromothripsis in 5% of samples, ranging from 0% of head and neck squamous cell carcinomas to 16% of glioblastomas (Fig. 2b and Online Methods). The rate of chromothripsis was not related to overall rates of SCNA (R = 0.13; P = 0.3). As previously reported30, samples with chromothripsis were more likely to have chromothripsis on more than 1 chromosome (14/122 samples with chromothripsis had 2 or 3 such events; P = 0.003).

Many chromothripsis events were concentrated in a few genomic regions, often associated with known driver events (Fig. 2c). In glioblastomas,

Percent ofsamples withWGD 6245 43 1143 2059 64 5327

a

0

0.5

1.0

Pur

ity

1

2

3

4

5+

LUA

D

LUS

C

HN

SC

KIR

C

BR

CA

BLC

A

CR

C

UC

EC

GB

M

OV

Plo

idy

0 500 1,000Samples

(all lineages)

Near diploid1 WGD2+ WGD

0

4

8

12

16

Arm

-leve

l SC

NA

s/sa

mpl

e

Amplification before WGD

Amplification after WGD

Amplification timing undetermined

0

4

8

12

16

0

20

40

60

Foc

al S

CN

As/

sam

ple

0

20

40

60

Amplification

Deletion before WGDDeletion after WGDDeletion timing undetermined

Near diploid WGD samples

Deletion

KIR

CC

OA

DH

NS

CU

CE

CG

BM

LUA

DLU

SC

BR

CA

BLC

AO

V

Near diploid WGD samples

KIR

CC

OA

DH

NS

CU

CE

CG

BM

LUA

DLU

SC

BR

CA

BLC

AO

V

KIR

CC

OA

DH

NS

CU

CE

CG

BM

LUA

DLU

SC

BR

CA

BLC

AO

V

KIR

CC

OA

DH

NS

CU

CE

CG

BM

LUA

DLU

SC

BR

CA

BLC

AO

V

Overall

Overall

Amplification Deletion

b

Amplification Deletion

Figure 1 Distribution of SCNAs across lineages. (a) Sample purity (top) and ploidy (bottom) across lineages (LUAD, lung adenocarcinoma; LUSC, lung squamous cell; HNSC, head and neck squamous cell; KIRC, kidney renal cell; BRCA, breast; BLCA, bladder; CRC, colorectal; UCEC, uterine cervix; GBM, glioblastoma multiformae; OV, ovary). Box plots show the median, first quartile and third quartile of purity in each lineage. Near-diploid samples are designated in purple; cancers that have undergone one or more than one WGD event are designated in green and red, respectively. Summary data for all lineages are indicated on the right. (b) Numbers of arm-level (top) and focal (bottom) amplifications (left) and deletions (right) across lineages. For each lineage, near-diploid samples and those with WGD events are indicated by bars on the left and right, respectively; SCNA in samples with WGD are resolved according to their timing relative to the WGD event.

Zack et al. Nature Genetics 2013

Page 22: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Goal — Quantify allele-specific SVs

22

Goal - Quantify Allele-Specific SVs

4

Goal - Quantify Allele-Specific SVs

4

Goal - Quantify Allele-Specific SVs

4

Page 23: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Weaver — algorithm overview

23

Probabilistic Graphical Model(Markov Random Field)

Mappability GC Content

Purity ASCNG ASCNS Timing of SV Phasing

SV list BAM file

1KGP haplotypes

SNP list

Cancer Genome Graph SNP linkage SNP LD

(B) (C)

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21

R12

R13

R14

R15

R16

R17

R18

R19

R20

R11 R21

R1R3 R4 R5

Rm

Rp

Rs

Rq

R6

R7

R8

R9

R10

RnR2

R1 R2 R3 R4 R5 R6 R10

R12 R13 R14 R16 R17 R18 R21

R1

R2

R3 R4

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R18 R19 R20 R21

R2

(A)

interchr

del

dup intrachrintrachr

mq→(12,14)

m(12,14)→q

m(12,14)→s

ms→(12,14)R(12,14)

R15

R11 R21

Rp

Rs

Rq

R(3,4) R(5,6)

R(19,20)

R11 R12 R13 R14 R15 R16 R18 R19 R20 R21del

Rt

m +R2 -R2

n +R6 -R10

p +R4 -R16

q -R12 -R21

s +R14 +R18

t +R16 -R18

label L_pos R_pos

2 2 1

1 1 1

2 2 1

2 2 1

2 2 1

1 1 2

L_allele R_allele CN

R1 30 0.33

R2 40 0.5

R5 20 0

R7 10 0

R12 20 0.5

R17 10 0

label cov allele_freq

2 1

2 2

2 0

1 0

1 1

0 1

CN_1 CN_2

Genomic

regions

SVs

Inputs Outputs

(D) (E)

𝜇 0 = 0; 𝜇1 = 1; b = 10

n

m p

t

sq

Time

Post-

Pre-

chrA chrB

chrA

chrB

R1

R2

R10

R(7,9)

Rt R17

R18

RnRm

R16

Coverage from read mapping

Rm

R2

Genome

Cancer

Cancer

Genome

(F)

R3R1

ΨCedge

node

edgeΨR

ΘC

node ΘR[30, 0.33] [30, 0.33]

[40, 0.5] Inputs

Outputs

Figure 2: Illustration of the MRF model used in Weaver. (A) Hypothetical cancer chromosomes with the information of SVsand CNAs hidden. Orange and blue segments represent paternal/maternal allele. Red dashed line represents linkages bySVs. (B) The cancer genome graph, constructed from (A), with nodes (boxes) representing genomic regions and edgesrepresenting reference adjacencies (solid lines) or cancer adjacencies (dashed lines). (C) MRF representation. Red boxesrepresent cancer nodes Rc that have included SVs information; green boxes are the same with (B) and represent genomenodes R; the lines between genome nodes are genome edges Er; the lines between cancer nodes and genome nodesare cancer edges (Ec). (D) Primary inputs and outputs are illustrated for each genome node and cancer node. Inputs arepresented in the form of [coverage, allele frequency], and outputs are presented with colored circles, with color showing theallele and occurrence reflecting the ASCNG. For example, the observed coverage and allele frequency are 30 and 0.33 forgenome node R1 and its output shows that there are two copies of orange alleles and one copy of blue allele for genomenode R1. For the cancer node Rm, the output (one blue circle) indicates that the duplication is on blue allele with one copy.(E) Blue boxes represent supernodes by merging blue shaded chains of genome nodes as shown in (C). (F) Input andoutput of MRF are separated into genomic regions and SVs. For region R1, the input is observed with coverage 30 andallele frequency 0.33; the output has two copies on allele 1 and one copy on allele 2. n is a post-aneuploid deletion withone copy and both breakpoints are on allele 1 of chrA. t is a pre-aneuploid deletion with two copies and both breakpointsare on allele 1 of chrB. SV m, p, q and s are from the allele that has not been duplicated.

19

Input

Output

Li et al. Cell Systems 2016

Page 24: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

24

Probabilistic Graphical Model(Markov Random Field)

Mappability GC Content

Purity ASCNG ASCNS Timing of SV Phasing

SV list BAM file

1KGP haplotypes

SNP list

Cancer Genome Graph SNP linkage SNP LD

100 kb21,850,000 21,900,000 21,950,000 22,000,000 22,050,000 22,100,000 22,150,000 22,200,000 22,250,000

MTAP C9orf53CDKN2ACDKN2A

CDKN2B-AS1

CDKN2B

142 _

0 _

chr9 9p23 21.3 21.1 12 9q12 13 31.1 32 33.1

Coverage

(A)

LOH & first amplification Deletion

Second amplification

(B)

Del1

Del2

ASCNS and Timing of SV

Del1

Del2

Del1

Del2

Figure 1: (A) Schema diagram for Weaver. Dark green boxes show the different types of analyses, unique to Weaver thatare not dealt with by other methods, while light green ones show ‘by-products’ of Weaver shown to have an improvementover existing methods. (B) An example demonstrating a Weaver output focused on ASCNS and Timing of SV. Dark bluesegments (two copies) and light blue segment (one copy) represent a portion of the MCF-7 genome that originated from thesame allele on chr9. The other allele was lost during tumorigenesis, resulting in LOH. The predicted evolution of this regionbased on Weaver’s output is shown at the bottom: the ASCNS of Del1 is 2 and the ASCNS of Del2 is 1; both deletionsoccurred after the first amplification of the allele and before the second amplification.

18

Page 25: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

! MRF: • genome node, cancer node, genome edge, cancer edge

25

(B) (C)

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21R12

R13

R14

R15R16

R17R18

R19

R20

R11 R21

R1 R3 R4 R5

Rm

Rp

Rs

Rq

R6

R7

R8

R9

R10

RnR2

R1 R2 R3 R4 R5 R6 R10

R12 R13 R14 R16 R17 R18 R21R1

R2

R3 R4

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R18 R19 R20 R21

R2

(A)

interchrdel

dup intrachrintrachr

mq→(12,14)

m(12,14)→q

m(12,14)→s

ms→(12,14)R(12,14)

R15

R11 R21

Rp

Rs

Rq

R(3,4) R(5,6)

R(19,20)

R11 R12 R13 R14 R15 R16 R18 R19 R20 R21del

Rt

m +R2 -R2n +R6 -R10p +R4 -R16q -R12 -R21s +R14 +R18t +R16 -R18

label L_pos R_pos2 2 11 1 12 2 1 2 2 12 2 11 1 2

L_allele R_allele CN

R1 30 0.33R2 40 0.5R5 20 0R7 10 0R12 20 0.5R17 10 0

label cov allele_freq2 1 2 22 01 01 10 1

CN_1 CN_2

Genomic regions

SVs

Inputs Outputs(D) (E)

𝜇 0 = 0; 𝜇1 = 1; b = 10

n

m p

t

sq

Time

Post-

Pre-

chrA chrB

chrA

chrB

R1

R2

R10

R(7,9)

Rt R17

R18

RnRm

R16

Figure 5: (A) Hypothetical cancer chromosomes with rearrangement structure hidden. Orange and blue segments repre-sent paternal/maternal allele. Red dashed line represent linkages by SVs. (B) The Cancer Genome Graph, constructedfrom (A), with nodes (boxes) representing genomic regions and edges representing reference (solid lines) or cancer (dashedlines) adjacencies. (C) MRF representation in Weaver. Red boxes represent cancer nodes(Rc) that have included SVsinformation; green boxes are the same with (B) and representing genome nodes(R); the lines between genome nodes aregenome edges(Er); the lines between cancer nodes and genome nodes are cancer edges(Ec). (D) Blue boxes representsupernodes by clustering blue shaded chains of genome nodes as shown in (C). (E) Input and output of MRF are separatedinto genomic regions and SVs. For region R1, the input is observed coverage 30 and allele frequency 0.33; the output is 2copies on allele 1 and 1 copy on allele 2. n is a post-aneuploid deletion with 1 copy and both breakpoints are on allele 1 ofchrA. t is a pre-aneuploid deletion with 2 copies and both breakpoints are on allele 1 of chrB. SV m, p, q and s are from theallele that has not been duplicated.

39

Cancer Genome Graph

(B) (C)

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10

R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21R12

R13

R14

R15R16

R17R18

R19

R20

R11 R21

R1 R3 R4 R5

Rm

Rp

Rs

Rq

R6

R7

R8

R9

R10

RnR2

R1 R2 R3 R4 R5 R6 R10

R12 R13 R14 R16 R17 R18 R21R1

R2

R3 R4

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R18 R19 R20 R21

R2

(A)

interchrdel

dup intrachrintrachr

mq→(12,14)

m(12,14)→q

m(12,14)→s

ms→(12,14)R(12,14)

R15

R11 R21

Rp

Rs

Rq

R(3,4) R(5,6)

R(19,20)

R11 R12 R13 R14 R15 R16 R18 R19 R20 R21del

Rt

m +R2 -R2n +R6 -R10p +R4 -R16q -R12 -R21s +R14 +R18t +R16 -R18

label L_pos R_pos2 2 11 1 12 2 1 2 2 12 2 11 1 2

L_allele R_allele CN

R1 30 0.33R2 40 0.5R5 20 0R7 10 0R12 20 0.5R17 10 0

label cov allele_freq2 1 2 22 01 01 10 1

CN_1 CN_2

Genomic regions

SVs

Inputs Outputs(D) (E)

𝜇 0 = 0; 𝜇1 = 1; b = 10

n

m p

t

sq

Time

Post-

Pre-

chrA chrB

chrA

chrB

R1

R2

R10

R(7,9)

Rt R17

R18

RnRm

R16

Figure 5: (A) Hypothetical cancer chromosomes with rearrangement structure hidden. Orange and blue segments repre-sent paternal/maternal allele. Red dashed line represent linkages by SVs. (B) The Cancer Genome Graph, constructedfrom (A), with nodes (boxes) representing genomic regions and edges representing reference (solid lines) or cancer (dashedlines) adjacencies. (C) MRF representation in Weaver. Red boxes represent cancer nodes(Rc) that have included SVsinformation; green boxes are the same with (B) and representing genome nodes(R); the lines between genome nodes aregenome edges(Er); the lines between cancer nodes and genome nodes are cancer edges(Ec). (D) Blue boxes representsupernodes by clustering blue shaded chains of genome nodes as shown in (C). (E) Input and output of MRF are separatedinto genomic regions and SVs. For region R1, the input is observed coverage 30 and allele frequency 0.33; the output is 2copies on allele 1 and 1 copy on allele 2. n is a post-aneuploid deletion with 1 copy and both breakpoints are on allele 1 ofchrA. t is a pre-aneuploid deletion with 2 copies and both breakpoints are on allele 1 of chrB. SV m, p, q and s are from theallele that has not been duplicated.

39

MRF representation

ONLINE METHODSThe overview of the Weaver algorithm is shown in Fig. 4. The input of Weaver is the BAM file of aligned

and unaligned reads from a particular tumor sample. If there is matched normal sample available, it will also beused (details in Section ). The first step is to call variants (including both SNPs and SVs) based on the BAM file. Inthe Weaver implementation, we designed our own SVs calling procedure with details in Supplementary Note 1(evaluation results in Table. 1). For SNPs, we utilized SAMtools (version 0.1.19) [49]. In order to infer ASCNG,as well as the phasing of SNPs, we only retain SNPs from the original SNP list from SAMtools with the followingcriteria: (i) being heterozygous; (ii) not in segmental duplications; (iii) mappability > 0.2; and (iv) reported in the1000 Genomes Project (1KGP). The SNP linkage calculations will be discussed in Section .

Using the intermediate results (yellow boxes in Fig. 4) including the cancer genome graph construction (Sec-tion ), the Weaver MRF model will be built. By solving the MRF MAP function (Equation ), Weaver generatesoutput as in the green boxes in Fig. 4 (see Fig. 5(E) for example).

The core modules in Weaver were written in C++. Weaver source code is freely available and can bedownloaded from: http://bioen-compbio.bioen.illinois.edu/weaver/.

Genome partitioning and cancer genome graph constructionWe first select a default size W (e.g., 5kb) and partition the genome into non-overlapping regions as follows:

(i) Breakpoints in SV set C must be on region boundaries; (ii) Each region may contain no more than one SNP; (iii)The size of each region must be W . The number of regions from initial segmentation ranges from 1.7 million to2 million from Weaver based on various datasets, depending on the size of loss of heterozygosity (LOH) regionsand the number of SVs. The rationale behind the segmentation step with SVs is that it is known that most of thetime ASCNG boundaries coincide with SV breakpoints [14]. Our segmentation approach using SV boundaries hasthe advantage to provide base-level ASCNG boundaries as compared to existing genome segmentation methods incopy number analysis, which typically use fixed segmentation size.

Given the segmentation of the genome and SV set C, we then build Cancer Genome Graph G := {R,E}(Fig. 5(B)), with nodes representing genomic region sets (R) and edges representing reference adjacencies (Er)(solid lines in the figure) if two nodes are adjacent in the normal genome and cancer adjacencies (Ec) (dashedlines in the figure) if two nodes are adjacent in the cancer genome by SV c linkage. Edge configurations E betweennode Ri and Rj can be represented as: (�iRi ⇠ �jRj), � 2 {+,�}, with + and � representing the tail (right) andhead (left) of a given genomic region R, e.g., (+Ri ⇠ �Ri+1) 2 Er, if Ri and Ri+1 are adjacent regions from thesame chromosome in the normal genome.

We then convert the original Cancer Genome Graph G := {R,E} into Markov Random Field (MRF, M :=

{R,Rc,Er,Ec}), which is a widely used probabilistic graphical model to estimate joint probabilities. The MRFcan be viewed as undirected graph and the aggregated inference problem in Weaver given sequencing data can beviewed as a maximum a posteriori (MAP) problem with hidden states and observations explained in the followingsections. Unlike conventional methods for estimating copy number changes based on hidden Markov models(HMMs), which are designed for sequential data and only consider the dependencies between ‘local’ variables,MAP solution of MRF model provides the most probable configuration of aneuploid cancer genomes with complexSVs, involving ‘global’ variable dependencies defined by long-range SVs. The detailed steps are described inSupplementary Note 6. In the following sections, we describe hidden states, observations, and formal functionof the MRF MAP problem. Details on potential functions on nodes and edges are provided in the SupplementaryNote.

Hidden states HFor ith genome node Ri 2 R ⇢ M, the hidden states are Hi={Ca

i ,Cbi , G

ai , G

bi}, where Ca

i = {Cai,0, ..., C

ai,K}

and Cbi = {Cb

i,0, ..., Cbi,K} are vectors of non-negative integral numbers representing copy numbers (CNs) for allele

a and b of kth population on Ri, respectively. When k = 0, it stands for the fraction of normal cells. Note thatalthough the Weaver algorithm is generic and in principle can be applied for multiple subclones (K > 1), inour current implementation, Weaver only processes tumor samples without significant subclonal structure (i.e.,

6

ONLINE METHODSThe overview of the Weaver algorithm is shown in Fig. 4. The input of Weaver is the BAM file of aligned

and unaligned reads from a particular tumor sample. If there is matched normal sample available, it will also beused (details in Section ). The first step is to call variants (including both SNPs and SVs) based on the BAM file. Inthe Weaver implementation, we designed our own SVs calling procedure with details in Supplementary Note 1(evaluation results in Table. 1). For SNPs, we utilized SAMtools (version 0.1.19) [49]. In order to infer ASCNG,as well as the phasing of SNPs, we only retain SNPs from the original SNP list from SAMtools with the followingcriteria: (i) being heterozygous; (ii) not in segmental duplications; (iii) mappability > 0.2; and (iv) reported in the1000 Genomes Project (1KGP). The SNP linkage calculations will be discussed in Section .

Using the intermediate results (yellow boxes in Fig. 4) including the cancer genome graph construction (Sec-tion ), the Weaver MRF model will be built. By solving the MRF MAP function (Equation ), Weaver generatesoutput as in the green boxes in Fig. 4 (see Fig. 5(E) for example).

The core modules in Weaver were written in C++. Weaver source code is freely available and can bedownloaded from: http://bioen-compbio.bioen.illinois.edu/weaver/.

Genome partitioning and cancer genome graph constructionWe first select a default size W (e.g., 5kb) and partition the genome into non-overlapping regions as follows:

(i) Breakpoints in SV set C must be on region boundaries; (ii) Each region may contain no more than one SNP; (iii)The size of each region must be W . The number of regions from initial segmentation ranges from 1.7 million to2 million from Weaver based on various datasets, depending on the size of loss of heterozygosity (LOH) regionsand the number of SVs. The rationale behind the segmentation step with SVs is that it is known that most of thetime ASCNG boundaries coincide with SV breakpoints [14]. Our segmentation approach using SV boundaries hasthe advantage to provide base-level ASCNG boundaries as compared to existing genome segmentation methods incopy number analysis, which typically use fixed segmentation size.

Given the segmentation of the genome and SV set C, we then build Cancer Genome Graph G := {R,E}(Fig. 5(B)), with nodes representing genomic region sets (R) and edges representing reference adjacencies (Er)(solid lines in the figure) if two nodes are adjacent in the normal genome and cancer adjacencies (Ec) (dashedlines in the figure) if two nodes are adjacent in the cancer genome by SV c linkage. Edge configurations E betweennode Ri and Rj can be represented as: (�iRi ⇠ �jRj), � 2 {+,�}, with + and � representing the tail (right) andhead (left) of a given genomic region R, e.g., (+Ri ⇠ �Ri+1) 2 Er, if Ri and Ri+1 are adjacent regions from thesame chromosome in the normal genome.

We then convert the original Cancer Genome Graph G := {R,E} into Markov Random Field (MRF, M :=

{R,Rc,Er,Ec}), which is a widely used probabilistic graphical model to estimate joint probabilities. The MRFcan be viewed as undirected graph and the aggregated inference problem in Weaver given sequencing data can beviewed as a maximum a posteriori (MAP) problem with hidden states and observations explained in the followingsections. Unlike conventional methods for estimating copy number changes based on hidden Markov models(HMMs), which are designed for sequential data and only consider the dependencies between ‘local’ variables,MAP solution of MRF model provides the most probable configuration of aneuploid cancer genomes with complexSVs, involving ‘global’ variable dependencies defined by long-range SVs. The detailed steps are described inSupplementary Note 6. In the following sections, we describe hidden states, observations, and formal functionof the MRF MAP problem. Details on potential functions on nodes and edges are provided in the SupplementaryNote.

Hidden states HFor ith genome node Ri 2 R ⇢ M, the hidden states are Hi={Ca

i ,Cbi , G

ai , G

bi}, where Ca

i = {Cai,0, ..., C

ai,K}

and Cbi = {Cb

i,0, ..., Cbi,K} are vectors of non-negative integral numbers representing copy numbers (CNs) for allele

a and b of kth population on Ri, respectively. When k = 0, it stands for the fraction of normal cells. Note thatalthough the Weaver algorithm is generic and in principle can be applied for multiple subclones (K > 1), inour current implementation, Weaver only processes tumor samples without significant subclonal structure (i.e.,

6

K = 1). We leave the cases for K > 1 as future work. Gai and Gb

i represent the genotype of allele a and b of Ri,which is independent from subclone structure since only germline SNPs are considered. For convenience, we alsoset variable Ci,k as the overall CN of kth population on Ri (Ci,k = Ca

i,k+Cbi,k). In our analysis of cancer genomes,

which typically have highly amplified regions, we do not have limit for Ci,k, as done by previous CNV methods.The hidden CN is bounded by the observation of sequencing depth on each region. Note that for regions with lowmappability or extreme GC content, it is not reliable to infer hidden state space with observed local sequencingcoverage; instead, we search the closest region and inherit its hidden state space setting, assuming that there is nodramatic state change between them.

The hidden states on cancer nodes Rc are discussed in Supplementary Method 4.

Observations OFor observation on R ⇢ M, on ith genomic region Ri 2 R, the observation from the hidden state is the

raw read coverage Oi on entire Ri, which can be estimated by BEDTools [50] based on BAM file. For tumorsample with matched normal genome sequenced, we calculate ONorm

i for the same Ri and normalize the Oi using:Onew

i = ONorm ⇥Oi/ONormi , where ONorm is the median coverage for all regions in the normal genome.

If Ri has SNP, Oai and Ob

i are the number of reads containing the SNP based on a/b allele, respectively,which can be obtained from SNP calling pipelines such as [51]. In practice, neither sequencing nor mapping isuniform across the genome. Here we consider two widely used factors, the GC-content and short read mappability.Using two HapMap samples NA18507 and NA12878, we split the human genome into consecutive 100bp bins andcalculated the average mapping coverage on each bin. Among the bins that have unexpected low or high coverageas compared to the rest of the genome, more than 91% have either mappability < 0.6 or GC-content < 0.2 or> 0.6. Therefore, we label all Ri as not read-depth informative, if mappability < 0.6 or GC-content < 0.2 or> 0.6. The read depth of those uninformative regions are inherited from neighboring regions.

Regarding observation on Er ⇢ M, within two adjacent genomic regions Ri, Ri+1 2 R, there are twoindependent observations for their genotype linkage.

(i) We assume the genotypes on i and i + 1 are Gai /G

bi and Ga

i+1/Gbi+1, respectively. We define the Linkage

Disequilibrium (LD) score for the phasing configuration Gai , G

ai+1/G

bi , G

bi+1 as:

LD(Gai , G

ai+1/G

bi , G

bi+1)

=

NLD(Gai , G

ai+1)⇥NLD(Gb

i , Gbi+1)

NLD(Gai , G

ai+1)⇥NLD(Gb

i , Gbi+1) +NLD(Ga

i , Gbi+1)⇥NLD(Gb

i , Gai+1)

where NLD(Gai , G

ai+1) is the number of phased haplotypes (total number 1092 ⇥ 2 in phase 1) in 1KGP with

genotype (Gai , G

ai+1). Other genotype configurations can be similarly calculated.

(ii) Similarly, we define the read linkage score for the phasing Gai , G

ai+1/G

bi , G

bi+1 as:

RL(Gai , G

ai+1/G

bi , G

bi+1) =

NRL(Gai , G

ai+1) +NRL(Gb

i , Gbi+1)

NRL(Ri, Ri+1)

where NRL(Ri, Ri+1) is the total number of reads covering genomic regions (Ri, Ri+1) and NRL(Gai , G

ai+1) is

total number of reads covering (Gai , G

ai+1). If there are no reads covering (Ri, Ri+1) (NRL(i, i+1) = 0), RL = 0.

Therefore, we define genotype linkage as

GL(Gai , G

ai+1/G

bi , G

bi+1) = log(LD(Ga

i , Gai+1/G

bi , G

bi+1) ⇤RL(Ga

i , Gai+1/G

bi , G

bi+1))

In real data application, we have found that RL and LD correlate very well. For example, in the MCF-7 analysis,when we chose SNP pairs with 100% RL support as gold standard, we found AUC= 0.9964 using LD scores.

Markov random field model MAfter we convert G into MRF M using steps in Supplementary Note 6, the MRF MAP problem is given by:

ˆH = argmaxH

8

<

:

X

i2R⇥R(O|Hi) +

X

c2C⇥C(O|Hc) +

X

i2R R(O|Hi, Hi+1) +

X

c2C

X

i2N (c)

C(Hi, Hc)

9

=

;

7

genome node potential function

cancer nodepotential function

genome edge potential function

cancer edgepotential function

Page 26: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

ASCN and SVs in MCF-7

26

! 83% of SVs have copy number > 1 ! 68% of the regions have imbalanced copy number ! We found 276 SVs after whole chromosome dup ! We have used physical mapping to validate the results

Page 27: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

ASCN and SVs in HeLa

! WGS reads obtained from Adey et al. Nature 2013

! ASCNG are 97% consistent with Adey et al. (Fosmid seq)

27

Structural variants were identified by clustering discordantlymapped reads from 40-kb and 3-kb mate-pair libraries (Supplemen-tary Fig. 8). Twenty interchromosomal links were identified, includinglinks for marker chromosomes M11 (9q33–11p14) and M14 (13q21–19p13). In addition, 209 HeLa-specific deletions and 8 inversions werefound (Supplementary Figs 9 and 11, and Supplementary Table 10).Only two genes that are impacted by HeLa-specific structural rearran-gements (Supplementary Table 11) intersected with SCGC (STK11(ref. 18), FHIT), both of which are recurrently deleted in cervicalcarcinomas18,19.

Conventional whole-genome sequencing fails to resolve haplotypephase, an essential aspect of the description and interpretation of non-haploid genomes, including cancer genomes20. Recently, severalgroups have demonstrated genome-wide measurement of local5 orsparse21 haplotypes, but these approaches have yet to be applied toaneuploid cancer genomes. To resolve haplotype phase across theHeLa genome, we sequenced pools of fosmid clones5. Specifically,we constructed three complex fosmid-clone libraries, and then carriedout limiting dilution and shotgun sequencing of 288 fosmid clonepools. In summary, these were estimated to include 518,293 individualnon-overlapping clones with a median insert size of 33 kb, for a totalphysical coverage of 6.33 of the haploid reference genome (Sup-plementary Fig. 12). The complement of likely inherited heterozygousvariants (SNP and indel, n 5 1.97 3 106) was ascertained by shotgunsequencing and by cross-referencing with calls made by the 1000Genomes Project, and then re-genotyped using reads from each clone

pool. Alleles that were present at distinct heterozygous sites within agiven clone were assigned, or ‘phased’, to the same inherited haplotype,and the unobserved alleles were implicitly phased to the oppositehaplotype. When overlapping clones from distinct pools were merged,this resulted in haplotype blocks with an N50 (the contig size abovewhich 50% of the total length of the haplotype assembly is included) of550 kb containing 90.6% of heterozygous variants that were probablyinherited.

Most of the HeLa genome is present at an uneven haplotype ratio(for example, 2:1 in regions in which copy number 5 3). We sought toexploit the resulting allelic imbalance to phase consecutive haplotypeblocks (Supplementary Fig. 13). We first calculated the cumulativeallelic ratio among shotgun reads for the SNVs residing in each hap-lotype block, which clustered closely with the underlying haplotyperatio. For example, in non-LOH regions with a copy number of 3 thathave ratios of 2:1 or 1:2, allelic ratios calculated for each block haddistributions centred on 0.32 or 0.65, close to the expected fractions ofone-third and two-thirds (Supplementary Fig. 14). Using these ratios,we merged haplotype blocks into scaffolds covering 1.96 Gb or 90.3%of the non-LOH HeLa genome (scaffold N50 of 44.8 megabases (Mb);Supplementary Table 12). The haplotype-resolved scaffolds were thenmerged with the copy-number map to produce a global, haplotype-resolved copy-number profile of the aneuploid HeLa genome (Fig. 1a,Supplementary Fig. 15 and Supplementary Table 13).

Phasing accuracy was independently confirmed by several methods.First, 99.7% of informative read pairs from 3-kb mate-pair sequencing

1 2 3 4 5

X

6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22

HPV integration

3q11M5, S

Linked position

Markerchromosome

nameSupported by

Sequence data

Colour indicatessuspectedhaplotype

Haplotype A

Haplotype B

Tandemduplication

Probablecontiguity

a

1q11M1

1q11M25

15q11M18

9p11M10

3p21M10

5q11M4

3p11M4

12q15M12

5p2xM7

11p14M11,S

9q33M11,S

9q33M11

19p13M14,S

21q11M18

20p11M15

13q21M14,S

15q11M18

3q11M1

1p11M2

9q11M2

15qM13

21q11M25 11q22

M11

5pmarker

M7HPV locus

4q31-35 6q13-2118q1

2

3

4

5

6

7

8

3q24-29

LOH

Chr18/

S3 window ratiosCCL-2 window ratiosS3 copy-number callsS3-specific differences

Win

dow

ratio

; cop

y nu

mbe

r

b

Genomic position

Figure 1 | Haplotype-resolved copynumber of the HeLa cancer cell linegenome. a, Copy-number profile ofHeLa split by haplotypes. Linksdenote likely contiguity and tandemduplications. Boxes indicate markerchromosomes identified by copy-number breakpoints (boxes arecoloured by haplotype; black,unknown; pink text, uncertainlocations; S, links confirmed bymate-pair sequencing). b, Windowedcopy-number ratios for HeLa CCL-2(green and purple, alternatingchromosomes) and HeLa S3 (grey),with predicted integer copy numberfor S3 (black). Notable straindifferences are indicated by redarrows (for example, reduced copyover chromosome 18q). The windowcontaining the HPV insertion andrearrangement is at elevated copy inboth strains.

RESEARCH LETTER

2 0 8 | N A T U R E | V O L 5 0 0 | 8 A U G U S T 2 0 1 3

Macmillan Publishers Limited. All rights reserved©2013

Adey et al. Nature, 2013

Page 28: The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Application to TCGA Data

! Inter-chromosomal chromothripsis

28

1X

62X

(A) (B)

FOXG1

4

2214

6(C)

Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571).(B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observedat boundaries of highly amplified region of chr14, indicating many rounds of BFB cycles happened. Gene FOXG1 is on thefold-back inversion boundary and highly amplified.

32

! Breakage-fusion-bridge amplifications

1X

62X

(A) (B)

FOXG1

4

2214

6(C)

Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571).(B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observedat boundaries of highly amplified region of chr14, indicating many rounds of BFB cycles happened. Gene FOXG1 is on thefold-back inversion boundary and highly amplified.

32

1X

62X

(A) (B)

FOXG1

4

2214

6(C)

Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571).(B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observedat boundaries of highly amplified region of chr14, indicating many rounds of BFB cycles happened. Gene FOXG1 is on thefold-back inversion boundary and highly amplified.

32

TCGA-36-1571